nhslogo CS4132 Data Analytics

Popularity of Tourism in Various Countries before the COVID-19 Pandemic by Lee Care Greene¶

Important Note: Please keep your report concise and relevant (i.e. show only relevant steps and visualizations used to answer your research questions).

Table of Content (with relevant hyperlinks to sections)¶

  • Motivation & Background
  • Summary of Research Questions & Results
  • Dataset
  • Methodology
    • Data Acquisition
    • Data Cleaning
  • EDA
    • 1. Which countries have the highest number of outbound tourism, inbound tourism, and domestic tourism?
    • 2. What is the country with the most visits (e.g. sum of inbound tourism and domestic tourism)?
    • 3. What are the countries with the most tourism in every region?
    • 4. Which country has the most tourism overall (e.g. sum of outbound tourism, inbound tourism, and domestic tourism)?
    • 5. How did tourism grow across various countries?
    • 6. Are there any correlations between tourism of countries / regions and other variables (heritage, interest over time, etc.)?
  • Results Findings & Conclusion
    • 1. Which countries have the highest number of outbound tourism, inbound tourism, and domestic tourism?
    • 2. What is the country with the most visits (e.g. sum of inbound tourism and domestic tourism)?
    • 3. What are the countries with the most tourism in every region?
    • 4. Which country has the most tourism overall (e.g. sum of outbound tourism, inbound tourism, and domestic tourism)?
    • 5. How did tourism grow across various countries?
    • 6. Are there any correlations between tourism of countries / regions and other variables (heritage, interest over time, etc.)?
  • Recommendations or Further Works
  • References

Motivation and Background¶

Give an overview of the project, motivation, background and goals.

Tourism has always been a huge part of entertainment, for people to relax and enjoy themselves. Tourism is an important source of income and employment for developed and developing countries. Unfortunately, COVID-19 disrupted this sector immensely and the tourism industry plummeted.

Tourism can be regarded as a social, cultural and economic phenomenon related to the movement of people outside their usual place of residence. Some different types of tourism include domestic tourism, inbound tourism, and outbound tourism. Domestic tourism comprises the activities of a resident visitor within the country of reference. Inbound tourism comprises the activities of a non-resident visitor within the country of reference. Finally, outbound tourism is comprises of the activities of a resident visitor outside the country of reference.

In this project, I will be analysing the number of departures and arrivals. Through this project, I would like to know every country's popularity in outbound, inbound, and domestic tourism. Through this, we can learn the popularity and reputation of different countries in other countries. I will only be analysing the data from 1995 to 2019, before the COVID-19 pandemic. I will also be finding correlations of the popularity with other variables.

Summary of Research Questions & Results¶

Repeat your research questions in a numbered list. After each research question, clearly state the answer/conclusion you determined. Do not give details or justifications yet — just the answer
  1. Which countries have the highest number of outbound tourism, inbound tourism, and domestic tourism?
    United States of America, France, and India have the highest number of outbound, inbound, and domestic tourism respectively.
  2. What is the country with the most visits (e.g. sum of inbound tourism and domestic tourism)?
    China is the most visited country overall.
  3. What are the countries with the most tourism in every region?
    Outbound Tourism: Africa: Namibia, Americas: United States of America, Asia: China, Europe: Germany, Oceania: Australia
    Inbound Tourism: Africa: Namibia, Americas: United States of America, Asia: China, Europe: France, Oceania: Australia
    Domestic Tourism: Africa: South Africa, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia
    Arrivals: Africa: South Africa, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia
    Total: Africa: Namibia, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia
  4. Which country has the most tourism overall (e.g. sum of outbound tourism, inbound tourism, and domestic tourism)?
    China has the most tourism overall.
  5. How did tourism grow across various countries?
    Tourism generally increases over time.
  6. Are there any correlations between tourism of countries / regions and other variables (heritage, interest over time, etc.)?
    Heritage and cultural influence have decent correlation with inbound tourism, while interest over time is unrelated to inbound tourism, and area has little correlation.

Dataset¶

Numbered list of dataset (with downloadable links) and a brief but clear description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
  1. https://www.unwto.org/tourism-statistics/key-tourism-statistics Download Link This data provides various key tourism statistics.
  2. https://www.usnews.com/media/best-countries/overall-rankings-2019.pdf?int=top_nav_Download_2019_Rankings The real ranking of countries in various aspects.
  3. https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes Download Link Region of all countries.
  4. https://trends.google.com/trends/?geo=SG pytrends to import data such as interest over time from Google Trends.
  5. https://developers.google.com/public-data/docs/canonical/countries_csv Latitude and longitude for all countries, similar to that from Lab 9.
  6. https://data.worldbank.org/indicator/AG.LND.TOTL.K2 Download Link Land area of all countries.
  7. https://github.com/simonepri/geo-maps Download Link High quality GeoJSON map of the world map.

Methodology¶

You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc).

First of all, I will be including all the necessary imports here.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import csv
import folium as folium
import pycountry
import plotly.express as px #pip install plotly==5.10.0 OR conda install -c plotly plotly=5.10.0
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Data Acquisition¶

Display the data which will be used in the project. The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report. Import and display each dataset in a dataframe. For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.

I am reading the files.

In [2]:
#1
data = pd.read_excel('unwto-all-data-download_0.xlsx', sheet_name=None, header = None)
#2 PDF loaded in Appendix
ranking_19 = pd.read_csv('overall_rankings_2019.csv')
#3 JSON loaded in Appendix
region = pd.read_csv('region.csv')
#4 Retrieved in Appendix
countries_interest = pd.read_csv('countries_interest.csv')
#5 Web scraped in Appendix
countries = pd.read_csv('countries.csv', index_col=0)
#6
area = pd.read_csv('API_AG.LND.TOTL.K2_DS2_en_csv_v2_4546125.csv', names = range(67))

Data Cleaning¶

For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any). If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.

I will be analysing 3 main datasets: outbound departures, inbound arrivals, and domestic trips.
I will first roughly cleaning the datasets to only include all the data that I may need. After that, I will further clean the data by picking and joining to match what I want to analyse.

Firstly, I will be roughly cleaning outbound departures.

In [3]:
#Getting dataframe for outbound departures
outbound_departures = data['Outbound Tourism-Departures']
#Dropping redundant data
outbound_departures = outbound_departures.drop([0,1])
outbound_departures = outbound_departures.drop(outbound_departures.tail(4).index)
outbound_departures = outbound_departures.iloc[: , :-1]
outbound_departures = outbound_departures.drop([0,1,2,4,7,9], axis = 1)
outbound_departures = outbound_departures.reset_index(drop=True)
outbound_departures.columns = outbound_departures.iloc[0]
outbound_departures = outbound_departures[1:]
outbound_departures = outbound_departures[outbound_departures.iloc[:,4]!='Departures']
outbound_departures.iloc[:,0].ffill(inplace = True)
outbound_departures = outbound_departures[outbound_departures.Units.notna()]
#Replacing '..' with NaN
outbound_departures = outbound_departures.replace('..', np.nan)
outbound_departures.iloc[:,1:3] = outbound_departures.iloc[:,1:3].ffill(axis=1)
#Renaming the columns
new_columns = list(outbound_departures.columns.values)
new_columns[0] = 'Countries'
new_columns[1] = 'placeholder'
new_columns[2] = 'Indicators'
outbound_departures.columns = new_columns
outbound_departures = outbound_departures.drop('placeholder', axis = 1)
outbound_departures = outbound_departures.reset_index(drop=True)
#Setting the index
outbound_departures.index = [np.array(outbound_departures['Countries']), np.array(outbound_departures['Indicators'])]
outbound_departures = outbound_departures.drop(['Countries', 'Indicators'], axis = 1)
#Renaming the index
outbound_departures.index.names = ['Countries', 'Indicators']
outbound_departures
Out[3]:
Units 1995 1996 1997 1998 1999 2000 2001 2002 2003 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Countries Indicators
AFGHANISTAN Total departures Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Overnights visitors (tourists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Same-day visitors (excursionists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ALBANIA Total departures Thousands NaN NaN NaN NaN NaN NaN 955.0 1303.0 1350.0 ... 4120.0 3959.0 3928.0 4146.0 4504.0 4852.0 5186.0 5415.0 5922.0 2907.0
Overnights visitors (tourists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZAMBIA Overnights visitors (tourists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Same-day visitors (excursionists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ZIMBABWE Total departures Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Overnights visitors (tourists) Thousands 256.0 69.0 123.0 213.0 331.0 NaN NaN NaN 386.0 ... 693.0 720.0 2946.0 3182.0 3393.0 3192.0 2768.0 2288.0 3275.0 NaN
Same-day visitors (excursionists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

669 rows × 27 columns

Secondly, I will be roughly cleaning the inbound arrivals.

In [4]:
#Getting dataframe for inbound departures
inbound_arrivals = data[' Inbound Tourism-Arrivals']
#Dropping redundant data
inbound_arrivals = inbound_arrivals.drop([0,1])
inbound_arrivals = inbound_arrivals.drop(inbound_arrivals.tail(8).index)
inbound_arrivals = inbound_arrivals.iloc[: , :-1]
inbound_arrivals = inbound_arrivals.drop([0,1,2,4,9,10], axis = 1)
inbound_arrivals = inbound_arrivals.reset_index(drop=True)
inbound_arrivals.columns = inbound_arrivals.iloc[0]
inbound_arrivals = inbound_arrivals[1:]
inbound_arrivals = inbound_arrivals[inbound_arrivals.iloc[:,4]!='Arrivals']
inbound_arrivals.iloc[:,0].ffill(inplace = True)
inbound_arrivals = inbound_arrivals[inbound_arrivals.Units.notna()]
#Replacing '..' with NaN
inbound_arrivals = inbound_arrivals.replace('..', np.nan)
inbound_arrivals.iloc[:,1:4] = inbound_arrivals.iloc[:,1:4].ffill(axis=1)
#Renaming the columns
new_columns = list(inbound_arrivals.columns.values)
new_columns[0] = 'Countries'
new_columns[1] = 'placeholder'
new_columns[2] = 'placeholder2'
new_columns[3] = 'Indicators'
inbound_arrivals.columns = new_columns
inbound_arrivals = inbound_arrivals.drop(['placeholder', 'placeholder2'], axis = 1)
inbound_arrivals = inbound_arrivals.reset_index(drop=True)
#Setting the index
inbound_arrivals.index = [np.array(inbound_arrivals['Countries']), np.array(inbound_arrivals['Indicators'])]
#Removing Indicators that are 'of which, cruise passengers' as it is a subset of 'Same-day visitors (excursionists)'
inbound_arrivals = inbound_arrivals[inbound_arrivals.Indicators != 'of which, cruise passengers']
inbound_arrivals = inbound_arrivals.drop(['Countries', 'Indicators'], axis = 1)
#Renaming the index
inbound_arrivals.index.names = ['Countries', 'Indicators']
inbound_arrivals
Out[4]:
Units 1995 1996 1997 1998 1999 2000 2001 2002 2003 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Countries Indicators
AFGHANISTAN Total arrivals Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Overnights visitors (tourists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Same-day visitors (excursionists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ALBANIA Total arrivals Thousands 304.0 287.0 119.0 184.0 371.0 317.0 354.0 470.0 557.0 ... 2932.0 3514.0 3256.0 3673.0 4131.0 4736.0 5118.0 5927.0 6406.0 2658.0
Overnights visitors (tourists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2469.0 3156.0 2857.0 3341.0 3784.0 4070.0 4643.0 5340.0 6128.0 2604.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZAMBIA Overnights visitors (tourists) Thousands 163.0 264.0 341.0 362.0 404.0 457.0 492.0 565.0 413.0 ... 920.0 859.0 915.0 947.0 932.0 956.0 1009.0 1072.0 1266.0 502.0
Same-day visitors (excursionists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ZIMBABWE Total arrivals Thousands 1416.0 1597.0 1336.0 2090.0 2250.0 1967.0 2217.0 2041.0 2256.0 ... 2423.0 1794.0 1833.0 1880.0 2057.0 2168.0 2423.0 2580.0 2294.0 639.0
Overnights visitors (tourists) Thousands 1363.0 1577.0 1281.0 1986.0 2101.0 1868.0 2068.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Same-day visitors (excursionists) Thousands 53.0 20.0 55.0 104.0 149.0 99.0 149.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

669 rows × 27 columns

Similar to outbound departures and inbound arrivals, I roughly cleaned up domestic trips.

In [5]:
#Getting dataframe for inbound departures
domestic_trips = data['Domestic Tourism-Trips']
#Dropping redundant data
domestic_trips = domestic_trips.drop([0,1])
domestic_trips = domestic_trips.drop(domestic_trips.tail(4).index)
domestic_trips = domestic_trips.iloc[: , :-1]
domestic_trips = domestic_trips.drop([0,1,2,4,9], axis = 1)
domestic_trips = domestic_trips.reset_index(drop=True)
domestic_trips.columns = domestic_trips.iloc[0]
domestic_trips = domestic_trips[1:]
domestic_trips = domestic_trips[domestic_trips.iloc[:,4]!='Trips']
domestic_trips.iloc[:,0].ffill(inplace = True)
domestic_trips = domestic_trips[domestic_trips.Units.notna()]
#Replacing '..' with NaN
domestic_trips = domestic_trips.replace('..', np.nan)
domestic_trips.iloc[:,1:4] = domestic_trips.iloc[:,1:4].ffill(axis=1)
#Renaming the columns
new_columns = list(domestic_trips.columns.values)
new_columns[0] = 'Countries'
new_columns[1] = 'placeholder'
new_columns[2] = 'placeholder2'
new_columns[3] = 'Indicators'
domestic_trips.columns = new_columns
domestic_trips = domestic_trips.drop(['placeholder', 'placeholder2'], axis = 1)
domestic_trips = domestic_trips.reset_index(drop=True)
#Setting the index
domestic_trips.index = [np.array(domestic_trips['Countries']), np.array(domestic_trips['Indicators'])]
domestic_trips = domestic_trips.drop(['Countries', 'Indicators'], axis = 1)
#Renaming the index
domestic_trips.index.names = ['Countries', 'Indicators']
domestic_trips
Out[5]:
Units 1995 1996 1997 1998 1999 2000 2001 2002 2003 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Countries Indicators
AFGHANISTAN Total trips Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Overnights visitors (tourists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Same-day visitors (excursionists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ALBANIA Total trips Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Overnights visitors (tourists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZAMBIA Overnights visitors (tourists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Same-day visitors (excursionists) Thousands NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ZIMBABWE Total trips Thousands NaN NaN NaN NaN 35233.0 32468.0 29178.0 22109.0 19894.0 ... 15878.0 16327.0 13431.0 13781.0 16427.0 16377.0 15707.0 15180.0 20991.0 NaN
Overnights visitors (tourists) Thousands NaN NaN NaN NaN 20427.0 18824.0 16917.0 12818.0 11534.0 ... 9206.0 9466.0 7787.0 7990.0 9524.0 9495.0 9106.0 8801.0 12157.0 NaN
Same-day visitors (excursionists) Thousands NaN NaN NaN NaN 14806.0 13644.0 12261.0 9291.0 8360.0 ... 6672.0 6861.0 5644.0 5791.0 6903.0 6882.0 6600.0 6379.0 8834.0 NaN

669 rows × 27 columns

Here I have the rankings of countries in different areas in 2019, with a overall ranking too. I am using 2019 as the latest ranking as I am analysing tourism before the pandemic.
I am only dropping the irrelevant column.

In [6]:
#Dropping redundant data
ranking_19 = ranking_19.drop('Unnamed: 0', axis=1)
ranking_19
Out[6]:
Overall Rank Country Entrepreneurship Adventure Citizenship Cultural Influence Heritage Movers Open for Business Power Quality of Life
0 1 Switzerland 4 17 3 9 27 25 2 14 5
1 2 Japan 1 39 17 6 10 5 22 7 13
2 3 Canada 6 19 2 12 42 39 7 12 1
3 4 Germany 2 57 12 11 20 34 21 4 10
4 5 United Kingdom 5 40 11 5 12 53 23 5 12
... ... ... ... ... ... ... ... ... ... ... ...
70 71 Jordan 62 71 74 70 48 50 63 33 77
71 72 Tunisia 69 60 76 65 53 63 55 63 68
72 73 Belarus 56 61 50 71 67 66 76 35 67
73 74 Nigeria 67 74 77 63 76 57 58 46 74
74 75 Pakistan 68 77 78 79 71 56 72 22 73

75 rows × 11 columns

Similarly, I am also dropping the irrelevant column for region. Region will be used to group data together based on location.

In [7]:
#Dropping redundant data
region = region.drop('Unnamed: 0', axis=1)
region
Out[7]:
name alpha-2 alpha-3 country-code iso_3166-2 region sub-region intermediate-region region-code sub-region-code intermediate-region-code
0 Afghanistan AF AFG 4 ISO 3166-2:AF Asia Southern Asia NaN 142.0 34.0 NaN
1 Åland Islands AX ALA 248 ISO 3166-2:AX Europe Northern Europe NaN 150.0 154.0 NaN
2 Albania AL ALB 8 ISO 3166-2:AL Europe Southern Europe NaN 150.0 39.0 NaN
3 Algeria DZ DZA 12 ISO 3166-2:DZ Africa Northern Africa NaN 2.0 15.0 NaN
4 American Samoa AS ASM 16 ISO 3166-2:AS Oceania Polynesia NaN 9.0 61.0 NaN
... ... ... ... ... ... ... ... ... ... ... ...
244 Wallis and Futuna WF WLF 876 ISO 3166-2:WF Oceania Polynesia NaN 9.0 61.0 NaN
245 Western Sahara EH ESH 732 ISO 3166-2:EH Africa Northern Africa NaN 2.0 15.0 NaN
246 Yemen YE YEM 887 ISO 3166-2:YE Asia Western Asia NaN 142.0 145.0 NaN
247 Zambia ZM ZMB 894 ISO 3166-2:ZM Africa Sub-Saharan Africa Eastern Africa 2.0 202.0 14.0
248 Zimbabwe ZW ZWE 716 ISO 3166-2:ZW Africa Sub-Saharan Africa Eastern Africa 2.0 202.0 14.0

249 rows × 11 columns

Here we have the countries and their respective latitude and longitude. I renamed some of the countries' names to match the geodata. However, I found a better geodata after that and renaming is not needed, but I still kept it.

In [8]:
countries = countries.reset_index(drop = True)
#Renaming countries' names to match geodata
countries = countries.replace({'name': {'United States': 'United States of America', 'Bahamas': 'The Bahamas', 'Serbia': 'Republic of Serbia', 'Macedonia [FYROM]': 'Macedonia', 'Myanmar [Burma]': 'Myanmar', 'Guinea-Bissau': 'Guinea Bissau', 'Congo [Republic]': 'Republic of the Congo', 'Tanzania': 'United Republic of Tanzania', 'Timor-Leste': 'East Timor'}})
countries
Out[8]:
country latitude longitude name
0 AD 42.546245 1.601554 Andorra
1 AE 23.424076 53.847818 United Arab Emirates
2 AF 33.939110 67.709953 Afghanistan
3 AG 17.060816 -61.796428 Antigua and Barbuda
4 AI 18.220554 -63.068615 Anguilla
... ... ... ... ...
240 YE 15.552727 48.516388 Yemen
241 YT -12.827500 45.166244 Mayotte
242 ZA -30.559482 22.937506 South Africa
243 ZM -13.133897 27.849332 Zambia
244 ZW -19.015438 29.154857 Zimbabwe

245 rows × 4 columns

I cleaned the data from Google Trends by making Date the index and also changing its dtype to datetime.

In [9]:
#Renaming 'date' into 'Date'
countries_interest = countries_interest.rename(columns={'date':'Date'})
#Setting Date as the index
countries_interest.set_index('Date', inplace=True)
#Changing the data type of date into datetime
countries_interest.index = pd.to_datetime(countries_interest.index)
countries_interest
Out[9]:
Afghanistan Åland Islands Albania American Samoa Andorra Angola Anguilla Antarctica Antigua and Barbuda Argentina ... Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates United Kingdom of Great Britain and Northern Ireland United States of America United States Minor Outlying Islands Viet Nam Zimbabwe
Date
2004-01-01 7 80 17 34 95 23 100 74 0 30 ... 43 60 41 2 69 0 48 0 9 29
2004-02-01 7 0 19 29 83 27 68 86 44 30 ... 100 100 42 2 73 0 53 0 11 28
2004-03-01 8 0 18 25 72 25 70 100 32 34 ... 78 63 40 2 77 0 42 100 12 29
2004-04-01 8 0 20 28 49 27 61 81 24 34 ... 28 58 43 2 75 0 45 80 13 31
2004-05-01 8 100 20 27 49 28 68 100 26 34 ... 5 54 43 2 71 0 45 0 11 30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2022-03-01 9 24 37 16 34 30 54 62 37 43 ... 18 39 91 89 34 55 15 11 20 26
2022-04-01 6 35 30 19 32 31 47 54 34 37 ... 18 42 92 33 28 60 16 15 17 25
2022-05-01 6 37 39 18 29 33 40 52 33 37 ... 15 39 89 22 28 61 16 13 20 29
2022-06-01 10 24 43 15 37 32 44 49 38 58 ... 18 33 98 16 30 41 13 18 20 39
2022-07-01 5 25 45 17 37 35 44 49 32 39 ... 17 38 100 13 27 38 12 16 19 46

223 rows × 227 columns

I will now clean the data of the area of all countries. I first remove all unnecessary rows and columns, follwed by changing the object data to float.

In [10]:
#Dropping redundant data
area = area.loc[2:, :63]
area = area.T.set_index(2).T.reset_index(drop = True)
area.columns.name = ''
area = area.drop([1960.0, 'Indicator Name', 'Indicator Code'], axis = 1)
#Changing the data type to float
area[np.arange(1961,2020)] = area[np.arange(1961,2020)].astype('float')
area
Out[10]:
Country Name Country Code 1961.0 1962.0 1963.0 1964.0 1965.0 1966.0 1967.0 1968.0 ... 2010.0 2011.0 2012.0 2013.0 2014.0 2015.0 2016.0 2017.0 2018.0 2019.0
0 Aruba ABW 180.0 180.0 180.0 180.0 180.0 180.0 180.0 180.0 ... 1.800000e+02 180.00 1.800000e+02 1.800000e+02 1.800000e+02 1.800000e+02 1.800000e+02 1.800000e+02 1.800000e+02 1.800000e+02
1 Africa Eastern and Southern AFE 14571611.0 14571611.0 14571611.0 14571611.0 14571611.0 14571611.0 14571611.0 14571611.0 ... 1.472096e+07 14721240.05 1.484517e+07 1.484513e+07 1.484509e+07 1.484514e+07 1.484515e+07 1.484514e+07 1.484515e+07 1.484516e+07
2 Afghanistan AFG 652230.0 652230.0 652230.0 652230.0 652230.0 652230.0 652230.0 652230.0 ... 6.522300e+05 652230.00 6.522300e+05 6.522300e+05 6.522300e+05 6.522300e+05 6.522300e+05 6.522300e+05 6.522300e+05 6.522300e+05
3 Africa Western and Central AFW 9046580.0 9046580.0 9046580.0 9046580.0 9046580.0 9046580.0 9046580.0 9046580.0 ... 9.045780e+06 9045780.00 9.045780e+06 9.045780e+06 9.045780e+06 9.045780e+06 9.045780e+06 9.045780e+06 9.045780e+06 9.045780e+06
4 Angola AGO 1246700.0 1246700.0 1246700.0 1246700.0 1246700.0 1246700.0 1246700.0 1246700.0 ... 1.246700e+06 1246700.00 1.246700e+06 1.246700e+06 1.246700e+06 1.246700e+06 1.246700e+06 1.246700e+06 1.246700e+06 1.246700e+06
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Kosovo XKX NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
262 Yemen, Rep. YEM 527970.0 527970.0 527970.0 527970.0 527970.0 527970.0 527970.0 527970.0 ... 5.279700e+05 527970.00 5.279700e+05 5.279700e+05 5.279700e+05 5.279700e+05 5.279700e+05 5.279700e+05 5.279700e+05 5.279700e+05
263 South Africa ZAF 1213090.0 1213090.0 1213090.0 1213090.0 1213090.0 1213090.0 1213090.0 1213090.0 ... 1.213090e+06 1213090.00 1.213090e+06 1.213090e+06 1.213090e+06 1.213090e+06 1.213090e+06 1.213090e+06 1.213090e+06 1.213090e+06
264 Zambia ZMB 743390.0 743390.0 743390.0 743390.0 743390.0 743390.0 743390.0 743390.0 ... 7.433900e+05 743390.00 7.433900e+05 7.433900e+05 7.433900e+05 7.433900e+05 7.433900e+05 7.433900e+05 7.433900e+05 7.433900e+05
265 Zimbabwe ZWE 386850.0 386850.0 386850.0 386850.0 386850.0 386850.0 386850.0 386850.0 ... 3.868500e+05 386850.00 3.868500e+05 3.868500e+05 3.868500e+05 3.868500e+05 3.868500e+05 3.868500e+05 3.868500e+05 3.868500e+05

266 rows × 61 columns

Now, I will be further cleaning the data to match what I want to analyse by picking and merging data.\ First, I defined two functions to return the alpha 2 and alpha 3 codes through fuzzy searches of the country names. I used search_fuzzy as some countries have various names. The alpha 2 and alpha 3 codes will be very useful to merge with other data or to plot the data as they are universal, unlike their names.

In [11]:
#Function to find alpha 2 code of country through its name
def findCountry2(country_name):
    try:
        return pycountry.countries.search_fuzzy(country_name)[0].alpha_2
    except:
        return None
In [12]:
#Function to find alpha 3 code of country through its name
def findCountry3(country_name):
    try:
        return pycountry.countries.search_fuzzy(country_name)[0].alpha_3
    except:
        return None

For the datasets below, I will add up 'Overnights visitors (tourists)' and 'Same-day visitors (excursionists)' to get 'Total departures'. I will be using millions as the units as it is easier to display on the folium choropleth map legend. I will also be forward filling before backward filling the data to replace missing data. This is to replace some missing data in countries, but not affect countries with no data at all. I also chose to analyse from 1995 to 2019 as we cannot reasonably fill the data for 2020 as there is a drastic drop in values. I will be adding a sum to total up all the data for each country over the years. I will also be adding the countries' alpha 2 and alpha 3 codes, before merging the data to also include the region, latitude, and longitude.

In [13]:
#Adding up 'Overnights visitors (tourists)' and 'Same-day visitors (excursionists)' to get 'Total departures'
outbound_2 = outbound_departures.reset_index()
outbound_2 = outbound_2[outbound_2['Indicators'] != 'Total departures']
outbound_2 = outbound_2.groupby('Countries').sum()[np.arange(1995,2020)].reset_index()
outbound_2 = outbound_2.set_index('Countries')
outbound_2

outbound = outbound_departures.reset_index()
#Selecting 'Total departures'
outbound = outbound[outbound['Indicators'] == 'Total departures']
outbound = outbound.drop(['Indicators', 'Units', 2020], axis = 1)
outbound = outbound.set_index('Countries')
#Replacing NaN data of outbound with values from outbound_2
outbound = outbound.combine_first(outbound_2).reset_index()
#Changing the units from thounsands to millions
outbound[outbound.select_dtypes(include = ['number']).columns] /= 1000.0
outbound.insert(0, 'Units', 'Millions')
#Setting all 0 to NaN
outbound = outbound.replace({0: np.nan})
#Forward filling before backward filling
outbound[np.arange(1995,2020)] = outbound[np.arange(1995,2020)].ffill(axis = 1).bfill(axis = 1)
#Summing the data
outbound['sum'] = outbound[np.arange(1995,2020)].sum(axis=1)
#Getting alpha 2 code
outbound['country_alpha_2'] = outbound.apply(lambda row: findCountry2(row.Countries) , axis = 1)
#Getting alpha 3 code
outbound['country_alpha_3'] = outbound.apply(lambda row: findCountry3(row.Countries) , axis = 1)
#Merging with countries to get latitude and longitude of each country
outbound = pd.merge(outbound, countries, left_on='country_alpha_2', right_on='country', how='left')
#Merging with region to get region of each country
outbound = pd.merge(outbound, region[['alpha-2', 'region']], left_on='country_alpha_2', right_on='alpha-2', how='left')
#Dropping redundant data
outbound = outbound.drop(['alpha-2', 'country', 'Countries'],axis=1)
#Rearranging the columns
cols = outbound.columns.tolist()
cols = [cols[-2]] + cols[-6:-4] + [cols[-1]] + cols[-4:-2] + cols[:-6]
outbound = outbound[cols]
#Dropping null data
outbound = outbound[outbound['name'].notna()]
#Renaming the column
outbound = outbound.rename(columns={'Units':'units'})
#Sorting through sum
outbound = outbound.sort_values(by='sum', ascending = False)
#Dropping duplicate data
outbound = outbound.drop_duplicates(subset = 'name')
#Resetting index
outbound =  outbound.reset_index(drop = True)
#Setting the alpha 2 code and alpha 3 code specifically for Namibia
outbound.loc[outbound['name'] == 'Namibia', 'country_alpha_2'] = 'NA'
outbound.loc[outbound['name'] == 'Namibia', 'country_alpha_3'] = 'NAM'
outbound
Out[13]:
name country_alpha_2 country_alpha_3 region latitude longitude units 1995 1996 1997 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 sum
0 United States of America US USA Americas 37.090240 -95.712891 Millions 74.031 76.803 78.481 ... 114.089 116.329 118.968 121.699 130.364 141.526 148.045 158.4454 170.9301 2781.9205
1 Mexico MX MEX Americas 23.634501 -102.552784 Millions 103.161 103.442 107.242 ... 88.113 87.332 90.787 90.982 94.988 97.372 94.274 86.2800 82.7520 2636.9450
2 Germany DE DEU Europe 51.165691 10.451526 Millions 55.800 55.800 55.800 ... 84.692 82.729 87.459 83.008 83.737 90.966 92.402 108.5420 99.5330 2048.5650
3 Namibia NA NAM Africa -22.957640 18.490410 Millions 47.594 47.594 47.594 ... 84.816 85.276 84.414 84.519 89.082 91.758 91.304 92.2140 94.7150 1833.0990
4 United Kingdom GB GBR Europe 55.378051 -3.435973 Millions 41.345 42.050 45.957 ... 67.493 66.858 68.959 72.204 77.619 81.757 87.242 90.5710 93.0860 1642.6580
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
200 Jamaica JM JAM Americas 18.109581 -77.297508 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0000
201 Iraq IQ IRQ Asia 33.223191 43.679291 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0000
202 Haiti HT HTI Americas 18.971187 -72.285215 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0000
203 Guyana GY GUY Americas 4.860416 -58.930180 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0000
204 Lebanon LB LBN Asia 33.854721 35.862285 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0000

205 rows × 33 columns

In [14]:
#Adding up 'Overnights visitors (tourists)' and 'Same-day visitors (excursionists)' to get 'Total arrivals'
inbound_2 = inbound_arrivals.reset_index()
inbound_2 = inbound_2[inbound_2['Indicators'] != 'Total arrivals']
inbound_2 = inbound_2.groupby('Countries').sum()[np.arange(1995,2020)].reset_index()
inbound_2 = inbound_2.set_index('Countries')
inbound_2

inbound = inbound_arrivals.reset_index()
#Selecting 'Total arrivals'
inbound = inbound[inbound['Indicators'] == 'Total arrivals']
inbound = inbound.drop(['Indicators', 'Units', 2020], axis = 1)
inbound = inbound.set_index('Countries')
#Replacing NaN data of inbound with values from inbound_2
inbound = inbound.combine_first(inbound_2).reset_index()
#Changing the units from thounsands to millions
inbound[inbound.select_dtypes(include = ['number']).columns] /= 1000.0
inbound.insert(0, 'Units', 'Millions')
#Setting all 0 to NaN
inbound = inbound.replace({0: np.nan})
#Forward filling before backward filling
inbound[np.arange(1995,2020)] = inbound[np.arange(1995,2020)].ffill(axis = 1).bfill(axis = 1)
#Summing the data
inbound['sum'] = inbound[np.arange(1995,2020)].sum(axis=1)
#Getting alpha 2 code
inbound['country_alpha_2'] = inbound.apply(lambda row: findCountry2(row.Countries) , axis = 1)
#Getting alpha 3 code
inbound['country_alpha_3'] = inbound.apply(lambda row: findCountry3(row.Countries) , axis = 1)
#Merging with countries to get latitude and longitude of each country
inbound = pd.merge(inbound, countries, left_on='country_alpha_2', right_on='country', how='left')
#Merging with region to get region of each country
inbound = pd.merge(inbound, region[['alpha-2', 'region']], left_on='country_alpha_2', right_on='alpha-2', how='left')
#Dropping redundant data
inbound = inbound.drop(['alpha-2', 'country', 'Countries'],axis=1)
#Rearranging the columns
cols = inbound.columns.tolist()
cols = [cols[-2]] + cols[-6:-4] + [cols[-1]] + cols[-4:-2] + cols[:-6]
inbound = inbound[cols]
#Dropping null data
inbound = inbound[inbound['name'].notna()]
#Renaming the column
inbound = inbound.rename(columns={'Units':'units'})
#Sorting through sum
inbound = inbound.sort_values(by='sum', ascending = False)
#Dropping duplicate data
inbound = inbound.drop_duplicates(subset = 'name')
#Resetting index
inbound =  inbound.reset_index(drop = True)
#Setting the alpha 2 code and alpha 3 code specifically for Namibia
inbound.loc[inbound['name'] == 'Namibia', 'country_alpha_2'] = 'NA'
inbound.loc[inbound['name'] == 'Namibia', 'country_alpha_3'] = 'NAM'
inbound
Out[14]:
name country_alpha_2 country_alpha_3 region latitude longitude units 1995 1996 1997 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 sum
0 France FR FRA Europe 46.227638 2.213749 Millions 60.033 148.263 157.551 ... 196.595000 197.522000 204.410000 206.599000 203.302000 203.042000 207.274000 211.998000 217.877 4001.252000
1 United States of America US USA Americas 37.090240 -95.712891 Millions 79.732 82.756 82.525 ... 147.271416 171.629897 179.309907 178.311354 176.864526 175.261488 174.291746 169.324918 165.478 3205.206252
2 China CN CHN Asia 35.861660 104.195397 Millions 46.387 51.128 57.588 ... 135.423000 132.405000 129.078000 128.499000 133.820000 141.774000 153.260000 158.606000 162.538 2805.217000
3 Mexico MX MEX Americas 23.634501 -102.552784 Millions 85.446 90.394 92.915 ... 75.732000 76.749000 78.100000 81.042000 87.129000 94.853000 99.349000 96.497000 97.406 2306.193000
4 Spain ES ESP Europe 40.463667 -3.749220 Millions 52.460 55.077 62.415 ... 99.187000 98.128000 103.231000 107.144000 109.834000 115.561000 121.717000 124.456000 126.170 2284.187000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
200 Somalia SO SOM Africa 5.152149 46.199616 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000
201 Equatorial Guinea GQ GNQ Africa 1.650801 10.267895 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000
202 Nauru NR NRU Oceania -0.522778 166.931503 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000
203 Liberia LR LBR Africa 6.428055 -9.429499 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000
204 Afghanistan AF AFG Asia 33.939110 67.709953 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000

205 rows × 33 columns

In [15]:
#Adding up 'Overnights visitors (tourists)' and 'Same-day visitors (excursionists)' to get 'Total trips'
domestic_2 = domestic_trips.reset_index()
domestic_2 = domestic_2[domestic_2['Indicators'] != 'Total trips']
domestic_2 = domestic_2.groupby('Countries').sum()[np.arange(1995,2020)].reset_index()
domestic_2 = domestic_2.set_index('Countries')
domestic_2

domestic = domestic_trips.reset_index()
#Selecting 'Total trips'
domestic = domestic[domestic['Indicators'] == 'Total trips']
domestic = domestic.drop(['Indicators', 'Units', 2020], axis = 1)
domestic = domestic.set_index('Countries')
#Replacing NaN data of domestic with values from domestic_2
domestic = domestic.combine_first(domestic_2).reset_index()
#Changing the units from thounsands to millions
domestic[domestic.select_dtypes(include = ['number']).columns] /= 1000.0
domestic.insert(0, 'Units', 'Millions')
#Setting all 0 to NaN
domestic = domestic.replace({0: np.nan})
#Forward filling before backward filling
domestic[np.arange(1995,2020)] = domestic[np.arange(1995,2020)].ffill(axis = 1).bfill(axis = 1)
#Summing the data
domestic['sum'] = domestic[np.arange(1995,2020)].sum(axis=1)
#Getting alpha 2 code
domestic['country_alpha_2'] = domestic.apply(lambda row: findCountry2(row.Countries) , axis = 1)
#Getting alpha 3 code
domestic['country_alpha_3'] = domestic.apply(lambda row: findCountry3(row.Countries) , axis = 1)
#Merging with countries to get latitude and longitude of each country
domestic = pd.merge(domestic, countries, left_on='country_alpha_2', right_on='country', how='left')
#Merging with region to get region of each country
domestic = pd.merge(domestic, region[['alpha-2', 'region']], left_on='country_alpha_2', right_on='alpha-2', how='left')
#Dropping redundant data
domestic = domestic.drop(['alpha-2', 'country', 'Countries'],axis=1)
#Rearranging the columns
cols = domestic.columns.tolist()
cols = [cols[-2]] + cols[-6:-4] + [cols[-1]] + cols[-4:-2] + cols[:-6]
domestic = domestic[cols]
#Dropping null data
domestic = domestic[domestic['name'].notna()]
#Renaming the column
domestic = domestic.rename(columns={'Units':'units'})
#Sorting through sum
domestic = domestic.sort_values(by='sum', ascending = False)
#Dropping duplicate data
domestic = domestic.drop_duplicates(subset = 'name')
#Resetting index
domestic =  domestic.reset_index(drop = True)
#Setting the alpha 2 code and alpha 3 code specifically for Namibia
domestic.loc[domestic['name'] == 'Namibia', 'country_alpha_2'] = 'NA'
domestic.loc[domestic['name'] == 'Namibia', 'country_alpha_3'] = 'NAM'
domestic
Out[15]:
name country_alpha_2 country_alpha_3 region latitude longitude units 1995 1996 1997 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 sum
0 China CN CHN Asia 35.861660 104.195397 Millions 629.000 640.000 644.000 ... 2641.000 2957.00 3262.000 3611.000 3990.000 4435.000 5010.000 5539.000 6005.852 55088.852
1 United States of America US USA Americas 37.090240 -95.712891 Millions 2004.500 2004.500 2004.500 ... 1998.500 2030.30 2059.600 2109.300 2178.700 2206.500 2248.700 2291.100 2326.623 51336.523
2 India IN IND Asia 20.593684 78.962880 Millions 136.644 140.120 159.877 ... 864.533 1045.05 1142.529 1282.802 1431.974 1615.389 1657.546 1853.788 2321.983 18773.016
3 United Kingdom GB GBR Europe 55.378051 -3.435973 Millions 126.010 126.010 126.010 ... 1668.640 1836.02 1710.905 1698.942 1649.626 1953.655 1914.076 1821.956 1776.080 18039.484
4 Japan JP JPN Asia 36.204824 138.252924 Millions 734.558 734.558 734.558 ... 612.525 612.75 630.950 595.221 604.715 641.079 647.510 561.779 587.103 17111.936
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
200 Iraq IQ IRQ Asia 33.223191 43.679291 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000
201 Jamaica JM JAM Americas 18.109581 -77.297508 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000
202 Kenya KE KEN Africa -0.023559 37.906193 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000
203 Kiribati KI KIR Oceania -3.370417 -168.734039 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000
204 Lebanon LB LBN Asia 33.854721 35.862285 Millions NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000

205 rows × 33 columns

EDA¶

For each research questions shortlisted, outline your methodology in answering them. Discuss interesting observations or results discovered. Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.

I have chose to mainly use folium and plotly to display the data. This is because these libraries allow interactive plots, which I believe lets the viewer understand the plots better as they can explore it themselves.

1. Which countries have the highest number of outbound tourism, inbound tourism, and domestic tourism?¶

I will be finding this through maps and bar graphs. The process is similar for the three subquestions. I will sum up all the data from 1995 to 2019.

i. Outbound tourism (departures)¶

Firstly, I will colour the countries on the world map. The colours of countries represent their respective values. I chose to use red and blue colour scale to highlight the countries with the highest and lowest outbound tourism. I used a map to allow the viewers to also get a sense of the location and size of the countries. It also allows for easier comparison between different countries.

In [16]:
#Creating map
outbound_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
                  data = outbound,
                  columns = ['country_alpha_3', 'sum'],
                  key_on = 'feature.properties.A3',
                  fill_color = 'RdBu',
                  fill_opacity = 0.8,
                  line_opacity = 0.2,
                  legend_name = 'Total Number of Outbound Tourism from 1995 to 2019 in Millions'
                 ).add_to(outbound_map)
outbound_map
Out[16]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Next, I added markers on top of the map. Clicking on the markers show the countries' names and their respective amount of outbound tourism from 1995 to 2019.

In [17]:
#Adding markers
for i in range(0,len(outbound)):
    folium.Marker(
        location = [outbound.iloc[i]['latitude'], outbound.iloc[i]['longitude']],
        popup = outbound.iloc[i]['name'] + '\n' + str(int(outbound.iloc[i]['sum'] * 1000000))
    ).add_to(outbound_map)
outbound_map
Out[17]:
Make this Notebook Trusted to load map: File -> Trust Notebook

I will now be plotting the bar graphs. I first consolidated the countries and their respective total number of outbound tourism from 1995 to 2019.

In [18]:
outbound_sum = outbound.copy()
#Setting name as index
outbound_sum = outbound_sum.set_index('name')
#Dropping redundant data
outbound_sum = outbound_sum[['sum']]
#Sorting data by sum
outbound_sum = outbound_sum.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
outbound_sum = outbound_sum[outbound_sum['sum'] > 0].T
#Multiplying by a million
outbound_sum *= 1000000
outbound_sum
Out[18]:
name United States of America Mexico Germany Namibia United Kingdom China Italy Poland Canada Russia ... Central African Republic Vanuatu Cook Islands Nigeria Tajikistan São Tomé and Príncipe Palau Angola Tuvalu Niue
sum 2.781920e+09 2.636945e+09 2.048565e+09 1.833099e+09 1.642658e+09 1.456823e+09 1.196624e+09 1.144544e+09 952199900.0 756519000.0 ... 485000.0 456400.0 279300.0 250000.0 236600.0 235400.0 225000.0 75000.0 60500.0 33200.0

1 rows × 132 columns

Here is a vertical bar graph. The top 15 countries are displayed by default and the viewers can adjust the plot accordingly to look at the values for other countries.

In [19]:
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
outbound_vbar = px.bar(outbound_sum.T, title = "Countries' Total Number of Outbound Tourism from 1995 to 2019", labels = {'value': 'Total Number of Outbound Tourism', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Outbound Tourism'}[t.name])).update_layout(xaxis_rangeslider_visible=True, xaxis_range=[-0.5, 14.5]).update_yaxes(fixedrange = False)
outbound_vbar

Similarly, here is a horizontal bar graph. I prefer this way of displaying as it acts as a form of ranking too.

In [20]:
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
outbound_hbar = px.bar(outbound_sum.T.sort_values(by = 'sum', ascending = True), title = "Countries' Total Number of Outbound Tourism from 1995 to 2019", labels = {'value': 'Total Number of Outbound Tourism', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Outbound Tourism'}[t.name])).update_layout(yaxis_range=[len(outbound_sum.columns)-15.5, len(outbound_sum.columns)-0.5])
outbound_hbar

ii. Inbound tourism (arrivals)¶

The process is similar to that of outbound tourism.

In [21]:
#Creating map
inbound_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
                  data = inbound,
                  columns = ['country_alpha_3', 'sum'],
                  key_on = 'feature.properties.A3',
                  fill_color = 'RdBu',
                  fill_opacity = 0.8,
                  line_opacity = 0.2,
                  legend_name = 'Total Number of Inbound Tourism from 1995 to 2019 in Millions'
                 ).add_to(inbound_map)
inbound_map
Out[21]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [22]:
#Adding markers
for i in range(0,len(inbound)):
    folium.Marker(
        location = [inbound.iloc[i]['latitude'], inbound.iloc[i]['longitude']],
        popup = inbound.iloc[i]['name'] + '\n' + str(int(inbound.iloc[i]['sum'] * 1000000))
    ).add_to(inbound_map)
inbound_map
Out[22]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [23]:
inbound_sum = inbound.copy()
#Setting name as index
inbound_sum = inbound_sum.set_index('name')
#Dropping redundant data
inbound_sum = inbound_sum[['sum']]
#Sorting data by sum
inbound_sum = inbound_sum.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
inbound_sum = inbound_sum[inbound_sum['sum'] > 0].T
#Multiplying by a million
inbound_sum *= 1000000
inbound_sum
Out[23]:
name France United States of America China Mexico Spain Poland Italy Croatia Hungary Canada ... Micronesia Comoros Guinea Bissau Kiribati Solomon Islands São Tomé and Príncipe Montserrat Marshall Islands Niue Tuvalu
sum 4.001252e+09 3.205206e+09 2.805217e+09 2.306193e+09 2.284187e+09 1.804662e+09 1.762915e+09 1.101202e+09 1.021397e+09 876738000.0 ... 630200.0 586900.0 556100.0 555100.0 407400.0 351500.0 303305.0 152400.0 119100.0 38700.0

1 rows × 200 columns

At first, I used matplotlib but prefered plotly in the end.

In [24]:
#Setting figure size and resolution of figure
plt.figure(figsize=(15,5), dpi=300)
#Plotting vertical bar graph
inbound_sum_plot = sns.barplot(data = inbound_sum.T.head(15).T, color = 'skyblue')
#Rotating the x-axis labels so that they do not overlap
inbound_sum_plot.set_xticklabels(inbound_sum_plot.get_xticklabels(), rotation=45, ha="right")
#Naming x-axis
inbound_sum_plot.set_xlabel('Country')
#Naming y-axis
inbound_sum_plot.set_ylabel('Total Number of Inbound Tourism from 1995 to 2019')
#Naming the figure
inbound_sum_plot.set_title('Top 15 Countries with Largest Total Number of Inbound Tourism from 1995 to 2019')
Out[24]:
Text(0.5, 1.0, 'Top 15 Countries with Largest Total Number of Inbound Tourism from 1995 to 2019')
In [25]:
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
inbound_vbar = px.bar(inbound_sum.T, title = "Countries' Total Number of Inbound Tourism from 1995 to 2019", labels = {'value': 'Total Number of Inbound Tourism', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Inbound Tourism'}[t.name])).update_layout(xaxis_rangeslider_visible=True, xaxis_range=[-0.5, 14.5]).update_yaxes(fixedrange = False)
inbound_vbar
In [26]:
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
inbound_hbar = px.bar(inbound_sum.T.sort_values(by = 'sum', ascending = True), title = "Countries' Total Number of Inbound Tourism from 1995 to 2019", labels = {'value': 'Total Number of Inbound Tourism', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Inbound Tourism'}[t.name])).update_layout(yaxis_range=[len(inbound_sum.columns)-15.5, len(inbound_sum.columns)-0.5])
inbound_hbar

iii. Domestic tourism (trips)¶

Finally, domestic tourism is also similar to outbound tourism and inbound tourism.

In [27]:
#Creating map
domestic_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
                  data = domestic,
                  columns = ['country_alpha_3', 'sum'],
                  key_on = 'feature.properties.A3',
                  fill_color = 'RdBu',
                  fill_opacity = 0.8,
                  line_opacity = 0.2,
                  legend_name = 'Total Number of Domestic Tourism from 1995 to 2019 in Millions'
                 ).add_to(domestic_map)
domestic_map
Out[27]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [28]:
#Adding markers
for i in range(0,len(domestic)):
    folium.Marker(
        location = [domestic.iloc[i]['latitude'], domestic.iloc[i]['longitude']],
        popup = domestic.iloc[i]['name'] + '\n' + str(int(domestic.iloc[i]['sum'] * 1000000))
    ).add_to(domestic_map)
domestic_map
Out[28]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [29]:
domestic_sum = domestic.copy()
#Setting name as index
domestic_sum = domestic_sum.set_index('name')
#Dropping redundant data
domestic_sum = domestic_sum[['sum']]
#Sorting data by sum
domestic_sum = domestic_sum.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
domestic_sum = domestic_sum[domestic_sum['sum'] > 0].T
#Multiplying by a million
domestic_sum *= 1000000
domestic_sum
Out[29]:
name China United States of America India United Kingdom Japan Spain Canada Indonesia France Australia ... Trinidad and Tobago Armenia Senegal Malta Swaziland Luxembourg Tajikistan Madagascar Moldova Mali
sum 5.508885e+10 5.133652e+10 1.877302e+10 1.803948e+10 1.711194e+10 8.054441e+09 6.169413e+09 6.123024e+09 6.109938e+09 6.092891e+09 ... 22442000.0 15368000.0 8685000.0 5268000.0 4909000.0 1892000.0 1666000.0 1133000.0 942700.0 908500.0

1 rows × 84 columns

In [30]:
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
domestic_vbar = px.bar(domestic_sum.T, title = "Countries' Total Number of Domestic Tourism from 1995 to 2019", labels = {'value': 'Total Number of Domestic Tourism', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Domestic Tourism'}[t.name])).update_layout(xaxis_rangeslider_visible=True, xaxis_range=[-0.5, 14.5]).update_yaxes(fixedrange = False)
domestic_vbar
In [31]:
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
domestic_hbar = px.bar(domestic_sum.T.sort_values(by = 'sum', ascending = True), title = "Countries' Total Number of Domestic Tourism from 1995 to 2019", labels = {'value': 'Total Number of Domestic Tourism', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Domestic Tourism'}[t.name])).update_layout(yaxis_range=[len(domestic_sum.columns)-15.5, len(domestic_sum.columns)-0.5])
domestic_hbar

2. What is the country with the most visits (e.g. sum of inbound tourism and domestic tourism)?¶

In my opinion, a country's popularity based on tourism is the number of arrivals. Hence, I chose to use the sum of inbound tourism and domestic tourism.

First, I will be merging the inbound tourism and domestic tourism.

In [32]:
total_arrivals = inbound.copy()
#Dropping redundant data
total_arrivals = total_arrivals[['name', 'country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum']]
#Renaming column
total_arrivals = total_arrivals.rename(columns = {'sum': 'inbound'})
#Merging
total_arrivals = pd.merge(total_arrivals, domestic[['name', 'sum']], on = 'name')
#Renaming column
total_arrivals = total_arrivals.rename(columns = {'sum': 'domestic'})
#Summing the total number of arrivals
total_arrivals['total_arrivals'] = total_arrivals[['inbound', 'domestic']].sum(axis = 1)
#Sorting the data by total number of arrivals
total_arrivals = total_arrivals.sort_values(by = 'total_arrivals', ascending = False).reset_index(drop = True)
total_arrivals
Out[32]:
name country_alpha_2 country_alpha_3 region latitude longitude units inbound domestic total_arrivals
0 China CN CHN Asia 35.861660 104.195397 Millions 2805.217000 55088.852 57894.069000
1 United States of America US USA Americas 37.090240 -95.712891 Millions 3205.206252 51336.523 54541.729252
2 India IN IND Asia 20.593684 78.962880 Millions 167.871000 18773.016 18940.887000
3 United Kingdom GB GBR Europe 55.378051 -3.435973 Millions 770.019000 18039.484 18809.503000
4 Japan JP JPN Asia 36.204824 138.252924 Millions 266.116000 17111.936 17378.052000
... ... ... ... ... ... ... ... ... ... ...
200 Somalia SO SOM Africa 5.152149 46.199616 Millions 0.000000 0.000 0.000000
201 Equatorial Guinea GQ GNQ Africa 1.650801 10.267895 Millions 0.000000 0.000 0.000000
202 Nauru NR NRU Oceania -0.522778 166.931503 Millions 0.000000 0.000 0.000000
203 Liberia LR LBR Africa 6.428055 -9.429499 Millions 0.000000 0.000 0.000000
204 Afghanistan AF AFG Asia 33.939110 67.709953 Millions 0.000000 0.000 0.000000

205 rows × 10 columns

Similar to the first question, I will be plotting maps and bar graphs.

In [33]:
#Creating map
arrivals_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
                  data = total_arrivals,
                  columns = ['country_alpha_3', 'total_arrivals'],
                  key_on = 'feature.properties.A3',
                  fill_color = 'RdBu',
                  fill_opacity = 0.8,
                  line_opacity = 0.2,
                  legend_name = 'Total Number of Arrivals from 1995 to 2019 in Millions'
                 ).add_to(arrivals_map)
arrivals_map
Out[33]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [34]:
#Adding markers
for i in range(0,len(total_arrivals)):
    folium.Marker(
        location = [total_arrivals.iloc[i]['latitude'], total_arrivals.iloc[i]['longitude']],
        popup = total_arrivals.iloc[i]['name'] + '\n' + str(int(total_arrivals.iloc[i]['total_arrivals'] * 1000000))
    ).add_to(arrivals_map)
arrivals_map
Out[34]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Next, I will be consolidating the data.

In [35]:
#Dropping redundant data
arrivals_sum = total_arrivals[['name', 'total_arrivals']]
#Setting name as index
arrivals_sum = arrivals_sum.set_index('name')
#Multiplying by a million
arrivals_sum *= 1000000
#Dropping data <= 0
arrivals_sum = arrivals_sum[arrivals_sum['total_arrivals'] > 0]
arrivals_sum
Out[35]:
total_arrivals
name
China 5.789407e+10
United States of America 5.454173e+10
India 1.894089e+10
United Kingdom 1.880950e+10
Japan 1.737805e+10
... ...
São Tomé and Príncipe 3.515000e+05
Montserrat 3.033050e+05
Marshall Islands 1.524000e+05
Niue 1.191000e+05
Tuvalu 3.870000e+04

200 rows × 1 columns

In [36]:
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
px.bar(arrivals_sum, title = 'Total Number of Arrivals from 1995 to 2019', labels = {'value': 'Total Number of Arrivals', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'total_arrivals': 'Total Number of Arrivals'}[t.name])).update_layout(xaxis_range=[-0.5, 14.5])
In [37]:
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
arrivals_bar = px.bar(arrivals_sum.sort_values(by = 'total_arrivals', ascending = True), title = 'Total Number of Arrivals from 1995 to 2019', labels = {'value': 'Total Number of Arrivals', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'total_arrivals': 'Total Number of Arrivals'}[t.name])).update_layout(yaxis_range=[len(arrivals_sum.index)-15.5, len(arrivals_sum.index)-0.5])
arrivals_bar

Now, I will be plotting stacked bar graphs as it shows the individual components of each country.

In [38]:
#Dropping redundant data
arrivals_stacked = total_arrivals[['name', 'inbound', 'domestic', 'total_arrivals']]
#Setting name and total_arrivals as the index
arrivals_stacked = arrivals_stacked.set_index(['name', 'total_arrivals'])
#Stack the data and renaming the columns
arrivals_stacked = arrivals_stacked.stack().to_frame().reset_index().rename(columns = {'level_2': 'type', 0: 'sum'})
#Dropping redundant data
arrivals_stacked = arrivals_stacked[['name', 'type', 'sum', 'total_arrivals']]
#Multiplying numerical data by a million
arrivals_stacked[arrivals_stacked.select_dtypes(include = ['number']).columns] *= 1000000
#Dropping data <= 0
arrivals_stacked = arrivals_stacked[arrivals_stacked['total_arrivals'] > 0]
arrivals_stacked
Out[38]:
name type sum total_arrivals
0 China inbound 2.805217e+09 5.789407e+10
1 China domestic 5.508885e+10 5.789407e+10
2 United States of America inbound 3.205206e+09 5.454173e+10
3 United States of America domestic 5.133652e+10 5.454173e+10
4 India inbound 1.678710e+08 1.894089e+10
... ... ... ... ...
395 Marshall Islands domestic 0.000000e+00 1.524000e+05
396 Niue inbound 1.191000e+05 1.191000e+05
397 Niue domestic 0.000000e+00 1.191000e+05
398 Tuvalu inbound 3.870000e+04 3.870000e+04
399 Tuvalu domestic 0.000000e+00 3.870000e+04

400 rows × 4 columns

In [39]:
#Plotting horizontal stacked bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
arrivals_stacked_bar = px.bar(arrivals_stacked.sort_values(by = 'total_arrivals', ascending = True), x = 'sum', y = 'name', color = 'type', title = 'Total Number of Arrivals from 1995 to 2019', labels = {'sum': 'Total Number of Arrivals', 'name': 'Country'}, orientation = 'h').update_layout(yaxis_range=[len(arrivals_stacked.index)/2.0-15.5, len(arrivals_stacked.index)/2.0-0.5])
arrivals_stacked_bar

3. What are the countries with the most tourism in every region?¶

Here, I will be comparing the data between regions and countries. Since this is a multivariate quantitative against categorical data analysis, I will be using treemap, grouped bar graph, categorical scatterplot, side-by-side boxplot, and side-by-side violinplot.

i. Outbound tourism (departures)¶

In [40]:
outbound_region = outbound.copy()
#Setting name as the index
outbound_region = outbound_region.set_index('name')
#Dropping redundant data
outbound_region = outbound_region[['region', 'sum']]
#Sorting data by sum
outbound_region = outbound_region.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
outbound_region = outbound_region[outbound_region['sum'] > 0]
#Multiplying numerical data by a million
outbound_region[outbound_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
outbound_region = outbound_region.reset_index()
outbound_region
Out[40]:
name region sum
0 United States of America Americas 2.781920e+09
1 Mexico Americas 2.636945e+09
2 Germany Europe 2.048565e+09
3 Namibia Africa 1.833099e+09
4 United Kingdom Europe 1.642658e+09
... ... ... ...
127 São Tomé and Príncipe Africa 2.354000e+05
128 Palau Oceania 2.250000e+05
129 Angola Africa 7.500000e+04
130 Tuvalu Oceania 6.050000e+04
131 Niue Oceania 3.320000e+04

132 rows × 3 columns

In [41]:
#Plotting treemap
outbound_tree = px.treemap(outbound_region, path=[px.Constant('World'), 'region', 'name'], values='sum', color='sum', color_continuous_scale='viridis', title = 'Total Number of Outbound Tourism from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Outbound Tourism'))
outbound_tree

ii. Inbound tourism (arrivals)¶

In [42]:
inbound_region = inbound.copy()
#Setting name as the index
inbound_region = inbound_region.set_index('name')
#Dropping redundant data
inbound_region = inbound_region[['region', 'sum']]
#Sorting data by sum
inbound_region = inbound_region.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
inbound_region = inbound_region[inbound_region['sum'] > 0]
#Multiplying numerical data by a million
inbound_region[inbound_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
inbound_region = inbound_region.reset_index()
inbound_region
Out[42]:
name region sum
0 France Europe 4.001252e+09
1 United States of America Americas 3.205206e+09
2 China Asia 2.805217e+09
3 Mexico Americas 2.306193e+09
4 Spain Europe 2.284187e+09
... ... ... ...
195 São Tomé and Príncipe Africa 3.515000e+05
196 Montserrat Americas 3.033050e+05
197 Marshall Islands Oceania 1.524000e+05
198 Niue Oceania 1.191000e+05
199 Tuvalu Oceania 3.870000e+04

200 rows × 3 columns

In [43]:
#Plotting treemap
inbound_tree = px.treemap(inbound_region, path=[px.Constant('World'), 'region', 'name'], values='sum', color='sum', color_continuous_scale='viridis', title = 'Total Number of Inbound Tourism from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Inbound Tourism'))
inbound_tree

iii. Domestic tourism (trips)¶

In [44]:
domestic_region = domestic.copy()
#Setting name as the index
domestic_region = domestic_region.set_index('name')
#Dropping redundant data
domestic_region = domestic_region[['region', 'sum']]
#Sorting data by sum
domestic_region = domestic_region.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
domestic_region = domestic_region[domestic_region['sum'] > 0]
#Multiplying numerical data by a million
domestic_region[domestic_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
domestic_region = domestic_region.reset_index()
domestic_region
Out[44]:
name region sum
0 China Asia 5.508885e+10
1 United States of America Americas 5.133652e+10
2 India Asia 1.877302e+10
3 United Kingdom Europe 1.803948e+10
4 Japan Asia 1.711194e+10
... ... ... ...
79 Luxembourg Europe 1.892000e+06
80 Tajikistan Asia 1.666000e+06
81 Madagascar Africa 1.133000e+06
82 Moldova Europe 9.427000e+05
83 Mali Africa 9.085000e+05

84 rows × 3 columns

In [45]:
#Plotting treemap
domestic_tree = px.treemap(domestic_region, path=[px.Constant('World'), 'region', 'name'], values='sum', color='sum', color_continuous_scale='viridis', title = 'Total Number of Domestic Tourism from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Domestic Tourism'))
domestic_tree

iv. Arrivals¶

In [46]:
arrivals_region = total_arrivals.copy()
#Setting name as the index
arrivals_region = arrivals_region.set_index('name')
#Dropping redundant data
arrivals_region = arrivals_region[['region', 'total_arrivals']]
#Sorting data by total_arrivals
arrivals_region = arrivals_region.sort_values(by = 'total_arrivals', ascending = False)
#Dropping data <= 0
arrivals_region = arrivals_region[arrivals_region['total_arrivals'] > 0]
#Multiplying numerical data by a million
arrivals_region[arrivals_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
arrivals_region = arrivals_region.reset_index()
arrivals_region
Out[46]:
name region total_arrivals
0 China Asia 5.789407e+10
1 United States of America Americas 5.454173e+10
2 India Asia 1.894089e+10
3 United Kingdom Europe 1.880950e+10
4 Japan Asia 1.737805e+10
... ... ... ...
195 São Tomé and Príncipe Africa 3.515000e+05
196 Montserrat Americas 3.033050e+05
197 Marshall Islands Oceania 1.524000e+05
198 Niue Oceania 1.191000e+05
199 Tuvalu Oceania 3.870000e+04

200 rows × 3 columns

In [47]:
#Plotting treemap
arrivals_tree = px.treemap(arrivals_region, path=[px.Constant('World'), 'region', 'name'], values='total_arrivals', color='total_arrivals', color_continuous_scale='RdBu', title = 'Total Number of Arrivals from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Arrivals'))
arrivals_tree

v. Total¶

First, I will be merging the outbound tourism, inbound tourism, and domestic tourism.

In [48]:
total = outbound.copy()
#Dropping redundant data
total = total[['name', 'country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum']]
#Renaming column
total = total.rename(columns = {'sum': 'outbound'})
#Merging
total = pd.merge(total, inbound[['name', 'sum']], on = 'name')
#Renaming column
total = total.rename(columns = {'sum': 'inbound'})
#Merging
total = pd.merge(total, domestic[['name', 'sum']], on = 'name')
#Renaming column
total = total.rename(columns = {'sum': 'domestic'})
#Summing the total number of tourism
total['total'] = total[['outbound', 'inbound', 'domestic']].sum(axis = 1)
#Sorting the data by total number of tourism
total = total.sort_values(by = 'total', ascending = False).reset_index(drop = True)
total
Out[48]:
name country_alpha_2 country_alpha_3 region latitude longitude units outbound inbound domestic total
0 China CN CHN Asia 35.861660 104.195397 Millions 1456.8230 2805.217000 55088.852 59350.892000
1 United States of America US USA Americas 37.090240 -95.712891 Millions 2781.9205 3205.206252 51336.523 57323.649752
2 United Kingdom GB GBR Europe 55.378051 -3.435973 Millions 1642.6580 770.019000 18039.484 20452.161000
3 India IN IND Asia 20.593684 78.962880 Millions 287.1590 167.871000 18773.016 19228.046000
4 Japan JP JPN Asia 36.204824 138.252924 Millions 422.0640 266.116000 17111.936 17800.116000
... ... ... ... ... ... ... ... ... ... ... ...
200 Liberia LR LBR Africa 6.428055 -9.429499 Millions 0.0000 0.000000 0.000 0.000000
201 Nauru NR NRU Oceania -0.522778 166.931503 Millions 0.0000 0.000000 0.000 0.000000
202 Somalia SO SOM Africa 5.152149 46.199616 Millions 0.0000 0.000000 0.000 0.000000
203 Afghanistan AF AFG Asia 33.939110 67.709953 Millions 0.0000 0.000000 0.000 0.000000
204 Equatorial Guinea GQ GNQ Africa 1.650801 10.267895 Millions 0.0000 0.000000 0.000 0.000000

205 rows × 11 columns

In [49]:
total_region = total.copy()
#Setting name as the index
total_region = total_region.set_index('name')
#Dropping redundant data
total_region = total_region[['region', 'total']]
#Sorting data by total
total_region = total_region.sort_values(by = 'total', ascending = False)
#Dropping data <= 0
total_region = total_region[total_region['total'] > 0]
#Multiplying numerical data by a million
total_region[total_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
total_region = total_region.reset_index()
total_region
Out[49]:
name region total
0 China Asia 5.935089e+10
1 United States of America Americas 5.732365e+10
2 United Kingdom Europe 2.045216e+10
3 India Asia 1.922805e+10
4 Japan Asia 1.780012e+10
... ... ... ...
195 Solomon Islands Oceania 4.074000e+05
196 Montserrat Americas 3.033050e+05
197 Marshall Islands Oceania 1.524000e+05
198 Niue Oceania 1.523000e+05
199 Tuvalu Oceania 9.920000e+04

200 rows × 3 columns

In [50]:
#Plotting treemap
total_tree = px.treemap(total_region, path=[px.Constant('World'), 'region', 'name'], values='total', color='total', color_continuous_scale='RdBu', title = 'Total Number of Tourism from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Tourism'))
total_tree
In [51]:
#Plotting grouped bar graph
px.bar(total_region, x = 'name', y = 'total', color = 'region', title = 'Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'})
In [52]:
#Plotting vertical grouped bar graph using only the top 5 countries of each region
px.bar(total_region.groupby('region').head(5), x = 'name', y = 'total', color = 'region', title = 'Top 5 Countries with Highest Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'})
In [53]:
#Plotting horizontal grouped bar graph using only the top 3 countries of each region and reversing the order
total_group = px.bar(total_region.groupby('region').head(3), x = 'total', y = 'name', color = 'region', title = 'Top 3 Countries with Highest Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'}, orientation = 'h').update_layout(yaxis = dict(autorange = 'reversed'))
total_group
In [54]:
#Plotting side-by-side boxplot with categorical scatterplot at the side using only the top 5 countries of each region
total_box = px.box(total_region.groupby('region').head(5), y = 'total', x = 'region', color = 'region', hover_data = total_region.columns, points = 'all', title = 'Boxplot of Top 5 Countries with Highest Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'})
total_box
In [55]:
#Plotting side-by-side violinplot
plt.figure(figsize=(15,10))
sns.violinplot(data = total_region, y = total_region['total'], x = total_region['region'], inner = 'quartile')
plt.show()
In [56]:
#Plotting side-by-side violinplot with categorical scatterplot at the side using only the top 5 countries of each region
total_violin = px.violin(total_region.groupby('region').head(5), y = 'total', x = 'region', color = 'region', box=True, points = 'all', hover_data = total_region.columns, title = 'Violinplot of Top 5 Countries with Highest Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'})
total_violin

4. Which country has the most tourism overall (e.g. sum of outbound tourism, inbound tourism, and domestic tourism)?¶

Similar to the previous questions, I will be summing the outbound tourism, inbound tourism, and domestic tourism from 1995 to 2019. I will also be using similar figures.

In [57]:
#Creating map
total_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
                  data = total,
                  columns = ['country_alpha_3', 'total'],
                  key_on = 'feature.properties.A3',
                  fill_color = 'RdBu',
                  fill_opacity = 0.8,
                  line_opacity = 0.2,
                  legend_name = 'Total Number of Tourism from 1995 to 2019 in Millions'
                 ).add_to(total_map)
total_map
Out[57]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [58]:
#Adding markers
for i in range(0,len(total)):
    folium.Marker(
        location = [total.iloc[i]['latitude'], total.iloc[i]['longitude']],
        popup = total.iloc[i]['name'] + '\n' + str(int(total.iloc[i]['total'] * 1000000))
    ).add_to(total_map)
total_map
Out[58]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [59]:
#Dropping redundant data
tourism_total = total[['name', 'total']]
#Setting name as index
tourism_total = tourism_total.set_index('name')
#Multiplying by a million
tourism_total *= 1000000
#Dropping data <= 0
tourism_total = tourism_total[tourism_total['total'] > 0]
tourism_total
Out[59]:
total
name
China 5.935089e+10
United States of America 5.732365e+10
United Kingdom 2.045216e+10
India 1.922805e+10
Japan 1.780012e+10
... ...
Solomon Islands 4.074000e+05
Montserrat 3.033050e+05
Marshall Islands 1.524000e+05
Niue 1.523000e+05
Tuvalu 9.920000e+04

200 rows × 1 columns

In [60]:
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
px.bar(tourism_total, title = 'Total Number of Tourism from 1995 to 2019', labels = {'value': 'Total Number of Tourism', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'total': 'Total Number of Tourism'}[t.name])).update_layout(xaxis_range=[-0.5, 14.5])
In [61]:
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
total_bar = px.bar(tourism_total.sort_values(by = 'total', ascending = True), title = 'Total Number of Tourism from 1995 to 2019', labels = {'value': 'Total Number of Tourism', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'total': 'Total Number of Tourism'}[t.name])).update_layout(yaxis_range=[len(tourism_total.index)-15.5, len(tourism_total.index)-0.5])
total_bar
In [62]:
#Dropping redundant data
total_stacked = total[['name', 'outbound', 'inbound', 'domestic', 'total']]
#Setting name and total_arrivals as the index
total_stacked = total_stacked.set_index(['name', 'total'])
#Stack the data and renaming the columns
total_stacked = total_stacked.stack().to_frame().reset_index().rename(columns = {'level_2': 'type', 0: 'sum'})
#Dropping redundant data
total_stacked = total_stacked[['name', 'type', 'sum', 'total']]
#Multiplying numerical data by a million
total_stacked[total_stacked.select_dtypes(include = ['number']).columns] *= 1000000
#Dropping data <= 0
total_stacked = total_stacked[total_stacked['total'] > 0]
total_stacked
Out[62]:
name type sum total
0 China outbound 1.456823e+09 5.935089e+10
1 China inbound 2.805217e+09 5.935089e+10
2 China domestic 5.508885e+10 5.935089e+10
3 United States of America outbound 2.781920e+09 5.732365e+10
4 United States of America inbound 3.205206e+09 5.732365e+10
... ... ... ... ...
595 Niue inbound 1.191000e+05 1.523000e+05
596 Niue domestic 0.000000e+00 1.523000e+05
597 Tuvalu outbound 6.050000e+04 9.920000e+04
598 Tuvalu inbound 3.870000e+04 9.920000e+04
599 Tuvalu domestic 0.000000e+00 9.920000e+04

600 rows × 4 columns

In [63]:
#Plotting vertical stacked bar graph and setting the x-axis to show the top 15 countries by default
px.bar(total_stacked, x = 'name', y = 'sum', color = 'type', title = 'Total Number of Tourism from 1995 to 2019', labels = {'value': 'Total Number of Tourism', 'name': 'Country'}).update_layout(xaxis_range=[-0.5, 14.5])

Here, I experimented with an x-axis rangeslider.

In [64]:
#Plotting vertical bar graph, setting an x-axis rangeslider, and setting the x-axis to show the top 15 countries by default
total_stacked_vbar = px.bar(total_stacked, x = 'name', y = 'sum', color = 'type', title = 'Total Number of Tourism from 1995 to 2019', labels = {'sum': 'Total Number of Tourism', 'name': 'Country'}).update_layout(xaxis_rangeslider_visible=True, xaxis_range=[-0.5, 14.5]).update_yaxes(fixedrange = False)
total_stacked_vbar
In [65]:
#Plotting horizontal stacked bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
total_stacked_hbar = px.bar(total_stacked.sort_values(by = 'total', ascending = True), x = 'sum', y = 'name', color = 'type', title = 'Total Number of Tourism from 1995 to 2019', labels = {'sum': 'Total Number of Tourism', 'name': 'Country'}, orientation = 'h').update_layout(yaxis_range=[len(total_stacked.index)/3.0-15.5, len(total_stacked.index)/3.0-0.5])
total_stacked_hbar

5. How did tourism grow across various countries?¶

i. Outbound tourism (departures)¶

In [66]:
outbound_year = outbound.copy()
outbound_year = outbound_year.drop(['country_alpha_2','country_alpha_3','latitude','longitude','units'], axis = 1)
outbound_year = outbound_year.set_index(['name', 'region'])
outbound_year = outbound_year[np.arange(1995,2020)].stack().to_frame().reset_index()
outbound_year = outbound_year.rename(columns = {'level_2': 'year', 0: 'sum'})
outbound_year['sum'] *= 1000000
outbound_year
Out[66]:
name region year sum
0 United States of America Americas 1995 74031000.0
1 United States of America Americas 1996 76803000.0
2 United States of America Americas 1997 78481000.0
3 United States of America Americas 1998 82758000.0
4 United States of America Americas 1999 84540000.0
... ... ... ... ...
3295 Niue Oceania 2015 1600.0
3296 Niue Oceania 2016 1600.0
3297 Niue Oceania 2017 1600.0
3298 Niue Oceania 2018 1600.0
3299 Niue Oceania 2019 1600.0

3300 rows × 4 columns

In [67]:
top_outbound = outbound.nlargest(10, 'sum')
top_outbound = top_outbound.set_index('name')
top_outbound = top_outbound.drop(['country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum'],axis=1)
top_outbound.columns.name = 'year'
top_outbound = top_outbound.T
top_outbound[top_outbound.select_dtypes(include=['number']).columns] *= 1000000
top_outbound
Out[67]:
name United States of America Mexico Germany Namibia United Kingdom China Italy Poland Canada Russia
year
1995 74031000.0 103161000.0 55800000.0 47594000.0 41345000.0 4520000.0 18173000.0 36387000.0 18206000.0 21329000.0
1996 76803000.0 103442000.0 55800000.0 47594000.0 42050000.0 5061000.0 18173000.0 44713000.0 18973000.0 12260000.0
1997 78481000.0 107242000.0 55800000.0 47594000.0 45957000.0 5324000.0 40196000.0 48610000.0 19111000.0 11182000.0
1998 82758000.0 107927000.0 69200000.0 47594000.0 50872000.0 8426000.0 42431000.0 49328000.0 17648000.0 10635000.0
1999 84540000.0 117383000.0 78100000.0 53144000.0 53881000.0 9232000.0 42390000.0 55097000.0 18368000.0 12631000.0
2000 87973000.0 127268000.0 80507000.0 58901000.0 56837000.0 10473000.0 44628000.0 56677000.0 19182000.0 18371000.0
2001 84755000.0 123732000.0 81551000.0 61096000.0 58281000.0 12133000.0 43611000.0 53122000.0 18359000.0 18030000.0
2002 80883000.0 124633000.0 80393000.0 64540000.0 59377000.0 16602000.0 44660000.0 45043000.0 17705000.0 20428000.0
2003 75880000.0 123015000.0 85345000.0 60936000.0 61424000.0 20222000.0 46357000.0 38730000.0 17739000.0 20572000.0
2004 79655000.0 128903000.0 84859000.0 68903000.0 64194000.0 28853000.0 40400000.0 37226000.0 19595000.0 24507000.0
2005 79215000.0 128392000.0 86622000.0 72300000.0 66494000.0 31026000.0 43407000.0 40841000.0 21099000.0 28416000.0
2006 148511000.0 122022000.0 81801000.0 75812000.0 69536000.0 34524000.0 46369000.0 44696000.0 46912000.0 29107000.0
2007 140364000.0 109540000.0 82099000.0 80682000.0 69450000.0 40954000.0 49166000.0 47561000.0 50044000.0 34285000.0
2008 136148000.0 107519000.0 86201000.0 81911000.0 69011000.0 45844000.0 54421000.0 50243000.0 51737000.0 36538000.0
2009 129954000.0 98228000.0 85547000.0 81958000.0 63513000.0 47656000.0 54839000.0 39270000.0 47481000.0 34276000.0
2010 121574000.0 91658000.0 85872000.0 84442000.0 64647000.0 57386000.0 55304000.0 42760000.0 53620000.0 39323000.0
2011 114089000.0 88113000.0 84692000.0 84816000.0 67493000.0 70250000.0 52617000.0 43270000.0 61909000.0 43726000.0
2012 116329000.0 87332000.0 82729000.0 85276000.0 66858000.0 83182000.0 53338000.0 48290000.0 65175000.0 47813000.0
2013 118968000.0 90787000.0 87459000.0 84414000.0 68959000.0 98185000.0 52633000.0 52580000.0 65780000.0 54069000.0
2014 121699000.0 90982000.0 83008000.0 84519000.0 72204000.0 116593000.0 55169000.0 35400000.0 63737000.0 45889000.0
2015 130364000.0 94988000.0 83737000.0 89082000.0 77619000.0 127860000.0 57418000.0 44300000.0 55971000.0 34550000.0
2016 141526000.0 97372000.0 90966000.0 91758000.0 81757000.0 135130000.0 57480000.0 44500000.0 52979000.0 31659000.0
2017 148045000.0 94274000.0 92402000.0 91304000.0 87242000.0 143035000.0 60042000.0 46700000.0 54955000.0 39629000.0
2018 158445400.0 86280000.0 108542000.0 92214000.0 90571000.0 149720000.0 61194600.0 48600000.0 38069000.0 41964000.0
2019 170930100.0 82752000.0 99533000.0 94715000.0 93086000.0 154632000.0 62207000.0 50600000.0 37845900.0 45330000.0
In [68]:
#Plotting line graph
outbound_growth = px.line(top_outbound, title = 'Number of Outbound Tourism of the Top 10 Countries over the Years', labels = {'value': 'Number of Outbound Tourism', 'year': 'Year'}, markers = True).update_layout(legend_title = 'Countries')
outbound_growth

ii. Inbound tourism (arrivals)¶

In [69]:
inbound_year = inbound.copy()
inbound_year = inbound_year.drop(['country_alpha_2','country_alpha_3','latitude','longitude','units'], axis = 1)
inbound_year = inbound_year.set_index(['name', 'region'])
inbound_year = inbound_year[np.arange(1995,2020)].stack().to_frame().reset_index()
inbound_year = inbound_year.rename(columns = {'level_2': 'year', 0: 'sum'})
inbound_year['sum'] *= 1000000
inbound_year
Out[69]:
name region year sum
0 France Europe 1995 60033000.0
1 France Europe 1996 148263000.0
2 France Europe 1997 157551000.0
3 France Europe 1998 70109000.0
4 France Europe 1999 73147000.0
... ... ... ... ...
4995 Tuvalu Oceania 2015 2400.0
4996 Tuvalu Oceania 2016 2500.0
4997 Tuvalu Oceania 2017 2500.0
4998 Tuvalu Oceania 2018 3100.0
4999 Tuvalu Oceania 2019 3700.0

5000 rows × 4 columns

In [70]:
#Plotting line graph
px.line(inbound_year, x = 'year', y = 'sum', color='region', line_group='name', title = 'Total Number of Inbound Tourism from 1995 to 2019', labels = {'sum': 'Number of Inbound Tourism', 'year': 'Year'}, markers = True).update_layout(legend_title = 'Region')
In [71]:
#Plotting area graph
px.area(inbound_year, x = 'year', y = 'sum', color='region', line_group='name', title = 'Total Number of Inbound Tourism from 1995 to 2019', labels = {'sum': 'Number of Inbound Tourism', 'year': 'Year'}).update_layout(legend_title = 'Region')
In [72]:
top_inbound = inbound.nlargest(10, 'sum')
top_inbound = top_inbound.set_index('name')
top_inbound = top_inbound.drop(['country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum'],axis=1)
top_inbound.columns.name = 'year'
top_inbound = top_inbound.T
top_inbound[top_inbound.select_dtypes(include=['number']).columns] *= 1000000
top_inbound
Out[72]:
name France United States of America China Mexico Spain Poland Italy Croatia Hungary Canada
year
1995 60033000.0 79732000.0 46387000.0 85446000.0 52460000.0 82244000.0 55706000.0 16100000.0 39240000.0 41657000.0
1996 148263000.0 82756000.0 51128000.0 90394000.0 55077000.0 87439000.0 59805000.0 19085000.0 39833000.0 43256000.0
1997 157551000.0 82525000.0 57588000.0 92915000.0 62415000.0 87817000.0 57998000.0 23660000.0 37315000.0 45076000.0
1998 70109000.0 74767000.0 63478000.0 95214000.0 68068000.0 88592000.0 58499000.0 25499000.0 33624000.0 48064000.0
1999 73147000.0 75796000.0 72796000.0 99869000.0 72040000.0 89118000.0 59521000.0 29215000.0 28803000.0 49055000.0
2000 77190000.0 78343000.0 83444000.0 105673000.0 74580000.0 84515000.0 62702000.0 37226000.0 31141000.0 48638000.0
2001 75202000.0 70975000.0 89013000.0 100718000.0 75564000.0 61431000.0 60960000.0 40129000.0 30679000.0 47147000.0
2002 77012000.0 64434000.0 97908000.0 100153000.0 79313000.0 50735000.0 63561000.0 41737000.0 31739000.0 44896000.0
2003 75048000.0 62082000.0 91662000.0 92330000.0 82326000.0 52130000.0 63026000.0 42857000.0 31412000.0 38903000.0
2004 190282000.0 67606000.0 109038000.0 99250000.0 85981000.0 61918000.0 58480000.0 44974000.0 33934000.0 38845000.0
2005 185829000.0 71484000.0 120292000.0 103146000.0 92563000.0 64606000.0 59230000.0 45762000.0 36173000.0 36160000.0
2006 193882000.0 183178000.0 124942000.0 97701000.0 96152000.0 65115000.0 66353000.0 47733000.0 38318000.0 33390000.0
2007 193319000.0 175299000.0 131873000.0 93582000.0 98907000.0 66208000.0 70271000.0 52271000.0 39379000.0 30373000.0
2008 193571000.0 175703000.0 130027000.0 92948000.0 97670000.0 59935000.0 70719000.0 51336000.0 39554000.0 27370000.0
2009 192369000.0 160508000.0 126476000.0 88044000.0 91899000.0 53840000.0 71692000.0 47573000.0 40624000.0 24696000.0
2010 189826000.0 162275000.0 133762000.0 81953000.0 93744000.0 58340000.0 73225000.0 49006000.0 39904000.0 25621000.0
2011 196595000.0 147271416.0 135423000.0 75732000.0 99187000.0 60745000.0 75866000.0 49969000.0 41304000.0 25066000.0
2012 197522000.0 171629897.0 132405000.0 76749000.0 98128000.0 67390000.0 76293000.0 47185000.0 43565000.0 25318000.0
2013 204410000.0 179309907.0 129078000.0 78100000.0 103231000.0 72310000.0 76762000.0 48345000.0 43611000.0 25167000.0
2014 206599000.0 178311354.0 128499000.0 81042000.0 107144000.0 73750000.0 77694000.0 51168000.0 45984000.0 25558000.0
2015 203302000.0 176864526.0 133820000.0 87129000.0 109834000.0 77743000.0 81068000.0 55858000.0 48345000.0 27555000.0
2016 203042000.0 175261488.0 141774000.0 94853000.0 115561000.0 80476000.0 84925000.0 57587000.0 52890000.0 30142000.0
2017 207274000.0 174291746.0 153260000.0 99349000.0 121717000.0 83804000.0 89931000.0 59238000.0 54962000.0 31081000.0
2018 211998000.0 169324918.0 158606000.0 96497000.0 124456000.0 85946000.0 93228600.0 57668000.0 57667000.0 31274000.0
2019 217877000.0 165478000.0 162538000.0 97406000.0 126170000.0 88515000.0 95399000.0 60021000.0 61397000.0 32430000.0

I initially used seaborn but prefered plotly.

In [73]:
plt.figure(figsize=(15,5), dpi=300)
sns.lineplot(data = top_inbound, dashes = False)
plt.legend(bbox_to_anchor=(1.01, 1.05))
plt.title('Number of Inbound Tourism of the Top 10 Countries over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Inbound Tourism')
plt.show()
In [74]:
#Plotting line graph
inbound_growth = px.line(top_inbound, title = 'Number of Inbound Tourism of the Top 10 Countries over the Years', labels = {'value': 'Number of Inbound Tourism', 'year': 'Year'}, markers = True).update_layout(legend_title = 'Countries')
inbound_growth

iii. Domestic tourism (trips)¶

In [75]:
domestic_year = domestic.copy()
domestic_year = domestic_year.drop(['country_alpha_2','country_alpha_3','latitude','longitude','units'], axis = 1)
domestic_year = domestic_year.set_index(['name', 'region'])
domestic_year = domestic_year[np.arange(1995,2020)].stack().to_frame().reset_index()
domestic_year = domestic_year.rename(columns = {'level_2': 'year', 0: 'sum'})
domestic_year['sum'] *= 1000000
domestic_year
Out[75]:
name region year sum
0 China Asia 1995 629000000.0
1 China Asia 1996 640000000.0
2 China Asia 1997 644000000.0
3 China Asia 1998 695000000.0
4 China Asia 1999 719000000.0
... ... ... ... ...
2095 Mali Africa 2015 31000.0
2096 Mali Africa 2016 26000.0
2097 Mali Africa 2017 24000.0
2098 Mali Africa 2018 24500.0
2099 Mali Africa 2019 23000.0

2100 rows × 4 columns

In [76]:
top_domestic = domestic.nlargest(10, 'sum')
top_domestic = top_domestic.set_index('name')
top_domestic = top_domestic.drop(['country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum'],axis=1)
top_domestic.columns.name = 'year'
top_domestic = top_domestic.T
top_domestic[top_domestic.select_dtypes(include=['number']).columns] *= 1000000
top_domestic
Out[76]:
name China United States of America India United Kingdom Japan Spain Canada Indonesia France Australia
year
1995 6.290000e+08 2.004500e+09 1.366440e+08 1.260100e+08 734558000.0 1.154560e+08 208165000.0 200589000.0 210755000.0 243920000.0
1996 6.400000e+08 2.004500e+09 1.401200e+08 1.260100e+08 734558000.0 1.154560e+08 208165000.0 200589000.0 210755000.0 243920000.0
1997 6.440000e+08 2.004500e+09 1.598770e+08 1.260100e+08 734558000.0 1.154560e+08 208165000.0 200589000.0 210755000.0 243920000.0
1998 6.950000e+08 2.004500e+09 1.681960e+08 1.260100e+08 734558000.0 1.154560e+08 208165000.0 200589000.0 210755000.0 243920000.0
1999 7.190000e+08 2.004500e+09 1.906710e+08 1.260100e+08 734558000.0 1.154560e+08 208165000.0 200589000.0 210755000.0 243920000.0
2000 7.440000e+08 2.004500e+09 2.201070e+08 1.260100e+08 734558000.0 3.710172e+08 208165000.0 200589000.0 210755000.0 235235000.0
2001 7.840000e+08 2.004500e+09 2.364700e+08 1.260100e+08 734558000.0 4.052927e+08 208165000.0 200589000.0 210755000.0 220593000.0
2002 8.780000e+08 2.004500e+09 2.695980e+08 1.260100e+08 734558000.0 4.029999e+08 208165000.0 200589000.0 210755000.0 217472000.0
2003 8.700000e+08 2.004500e+09 3.090380e+08 1.260100e+08 734558000.0 4.326621e+08 208165000.0 200589000.0 210755000.0 212681000.0
2004 1.102000e+09 2.004500e+09 3.662680e+08 1.260100e+08 734558000.0 4.156371e+08 208165000.0 200589000.0 210755000.0 203869000.0
2005 1.212000e+09 2.004500e+09 3.920140e+08 1.260100e+08 734558000.0 1.570054e+08 208165000.0 200589000.0 210755000.0 200044000.0
2006 1.394000e+09 2.004500e+09 4.623210e+08 1.260100e+08 734558000.0 1.549684e+08 208165000.0 204553000.0 284337000.0 208028000.0
2007 1.610000e+09 2.004500e+09 5.265640e+08 1.260100e+08 734558000.0 3.555545e+08 214559000.0 222389000.0 288609000.0 223980000.0
2008 1.712000e+09 1.964900e+09 5.630340e+08 1.260100e+08 734558000.0 3.929012e+08 214498000.0 225041000.0 278950000.0 210754000.0
2009 1.902000e+09 1.900100e+09 6.688000e+08 1.260100e+08 702896000.0 3.652531e+08 227121000.0 229730000.0 278275000.0 215846000.0
2010 2.103000e+09 1.963700e+09 7.477000e+08 1.194340e+08 631596000.0 3.544248e+08 229158000.0 234377000.0 268041000.0 225239000.0
2011 2.641000e+09 1.998500e+09 8.645330e+08 1.668640e+09 612525000.0 3.476951e+08 317021000.0 236751000.0 276752000.0 233127000.0
2012 2.957000e+09 2.030300e+09 1.045050e+09 1.836020e+09 612750000.0 3.728110e+08 316254000.0 245290000.0 268673000.0 248377000.0
2013 3.262000e+09 2.059600e+09 1.142529e+09 1.710905e+09 630950000.0 3.984230e+08 320266300.0 250036000.0 265182000.0 240118000.0
2014 3.611000e+09 2.109300e+09 1.282802e+09 1.698942e+09 595221000.0 4.627610e+08 318208700.0 251237000.0 266027000.0 260362000.0
2015 3.990000e+09 2.178700e+09 1.431974e+09 1.649626e+09 604715000.0 3.722650e+08 315745700.0 256419000.0 256078000.0 269481000.0
2016 4.435000e+09 2.206500e+09 1.615389e+09 1.953655e+09 641079000.0 3.971340e+08 319315000.0 264338000.0 255498000.0 280325000.0
2017 5.010000e+09 2.248700e+09 1.657546e+09 1.914076e+09 647510000.0 4.483050e+08 325808200.0 270822000.0 276537000.0 291797000.0
2018 5.539000e+09 2.291100e+09 1.853788e+09 1.821956e+09 561779000.0 4.464790e+08 278060000.0 303403000.0 268152000.0 310166000.0
2019 6.005852e+09 2.326623e+09 2.321983e+09 1.776080e+09 587103000.0 4.235720e+08 275418000.0 722159000.0 260522000.0 365797000.0
In [77]:
#Plotting line graph
domestic_growth = px.line(top_domestic, title = 'Number of Domestic Tourism of the Top 10 Countries over the Years', labels = {'value': 'Number of Domestic Tourism', 'year': 'Year'}, markers = True).update_layout(legend_title = 'Countries')
domestic_growth

6. Are there any correlations between tourism of countries / regions and other variables (heritage, interest over time, etc.)?¶

i. Rankings of Countries¶

Here I am using the rankings of countries in 2019 to compare with inbound tourism in 2019.

In [78]:
inbound_19 = inbound[['name',2019]].dropna().sort_values(by = 2019, ascending = False).reset_index(drop = True).reset_index().drop(2019, axis = 1)
inbound_19 = inbound_19.rename(columns = {'index': 'Inbound Tourism'})
inbound_19['Inbound Tourism'] += 1
inbound_19 = pd.merge(ranking_19, inbound_19, left_on = 'Country', right_on = 'name', how = 'inner').drop('name', axis = 1)
inbound_19
Out[78]:
Overall Rank Country Entrepreneurship Adventure Citizenship Cultural Influence Heritage Movers Open for Business Power Quality of Life Inbound Tourism
0 1 Switzerland 4 17 3 9 27 25 2 14 5 41
1 2 Japan 1 39 17 6 10 5 22 7 13 21
2 3 Canada 6 19 2 12 42 39 7 12 1 19
3 4 Germany 2 57 12 11 20 34 21 4 10 14
4 5 United Kingdom 5 40 11 5 12 53 23 5 12 12
... ... ... ... ... ... ... ... ... ... ... ... ...
68 71 Jordan 62 71 74 70 48 50 63 33 77 66
69 72 Tunisia 69 60 76 65 53 63 55 63 68 45
70 73 Belarus 56 61 50 71 67 66 76 35 67 40
71 74 Nigeria 67 74 77 63 76 57 58 46 74 68
72 75 Pakistan 68 77 78 79 71 56 72 22 73 132

73 rows × 12 columns

Firstly, I will plot a pairplot to see the correlation between various rankings and ranking of inbound tourism.

In [79]:
inbound_ranking = sns.pairplot(inbound_19, y_vars=['Inbound Tourism']).fig

We can see that Heritage and Cultural Influence have the most positive correlation. Hence, I will be further examining them.

In [80]:
sns.regplot(data = inbound_19, x = 'Heritage', y = 'Inbound Tourism')
plt.show()
In [81]:
sns.jointplot(data = inbound_19, x = 'Heritage', y = 'Inbound Tourism')
plt.show()
In [82]:
sns.jointplot(data = inbound_19, x = 'Heritage', y = 'Inbound Tourism', kind = 'hex')
plt.show()
In [83]:
sns.jointplot(data = inbound_19, x = 'Heritage', y = 'Inbound Tourism', kind = 'kde')
plt.show()
In [84]:
inbound_heritage_contour = px.density_contour(inbound_19, x = 'Heritage', y = 'Inbound Tourism', title = '2D Histogram Contour Plot between the Rankings for Inbound Tourism and Heritage of Countries in 2019').update_traces(contours_coloring="fill", contours_showlabels = True)
inbound_heritage_contour
In [85]:
inbound_heritage = px.scatter(inbound_19, x = 'Heritage', y = 'Inbound Tourism', trendline='ols', trendline_color_override='darkblue', title = 'The Correlation between the Rankings for Inbound Tourism and Heritage of Countries in 2019')
inbound_heritage
In [86]:
heritage_coef, heritage_value = stats.pearsonr(inbound_19['Heritage'], inbound_19['Inbound Tourism'])
heritage_coef, heritage_value
Out[86]:
(0.6676240739225271, 1.1103724933546565e-10)
In [87]:
inbound_culture_contour = px.density_contour(inbound_19, x = 'Cultural Influence', y = 'Inbound Tourism', title = '2D Histogram Contour Plot between the Rankings for Inbound Tourism and Cultural Influence of Countries in 2019').update_traces(contours_coloring="fill", contours_showlabels = True)
inbound_culture_contour
In [88]:
inbound_culture = px.scatter(inbound_19, x = 'Cultural Influence', y = 'Inbound Tourism', trendline='ols', trendline_color_override='darkblue', title = 'The Correlation between the Rankings for Inbound Tourism and Cultural Influence of Countries in 2019')
inbound_culture
In [89]:
culture_coef, culture_value = stats.pearsonr(inbound_19['Cultural Influence'], inbound_19['Inbound Tourism'])
culture_coef, culture_value
Out[89]:
(0.4942019692089541, 8.846681908836526e-06)

Since heritage has the most correlation, I will use it to form a linear regression model.

In [90]:
lm = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(inbound_19[['Heritage']], inbound_19['Inbound Tourism'], test_size=0.2, random_state=0)
lm.fit(X_train, y_train)
yhat = lm.predict(X_test)
yhat
sns.kdeplot(y_test, color='r', label='Actual Value')
sns.kdeplot(yhat,  color='b', label='Fitted Value')
plt.legend()
Out[90]:
<matplotlib.legend.Legend at 0x1e9dead3970>
In [91]:
lm.score(X_train, y_train)
Out[91]:
0.44387460785374466

The model is decent but overpredicts near the middle.

Now, I will use all the variables to form a multiple linear regression model.

In [92]:
mlm = LinearRegression()
X_train2, X_test2, y_train2, y_test2 = train_test_split(inbound_19[['Entrepreneurship','Adventure','Citizenship','Cultural Influence','Heritage','Movers','Open for Business','Power','Quality of Life']], inbound_19['Inbound Tourism'], test_size=0.2, random_state=0)
mlm.fit(X_train2, y_train2)
yhat2 = mlm.predict(X_test2)

sns.kdeplot(y_test2, color='r', label='Actual Value')
sns.kdeplot(yhat2,  color='b', label='Fitted Value')
plt.legend()
Out[92]:
<matplotlib.legend.Legend at 0x1e9de7c9a60>
In [93]:
mlm.score(X_train2, y_train2)
Out[93]:
0.5532718300044949

Compared to before, this model is more accurate, but misses a dent in the middle.

ii. Interest over time¶

Aside from that, I will be finding correlation between inbound tourism and interest over time. I got the interest over time from Google Trends.

In [94]:
year = countries_interest.reset_index()
year.Date = year.Date.astype('string').str[:4].astype('int')
year = year.groupby('Date').sum()
year = year.loc[np.arange(2004,2020)]
year = year.T.reset_index()
year.columns.name = ''
year[np.arange(2004,2020)] = year[np.arange(2004,2020)].div(year[np.arange(2004,2020)].max(axis = 1),axis = 0) * 100
year
Out[94]:
index 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 Afghanistan 73.684211 61.403509 61.403509 55.263158 61.403509 91.228070 100.000000 84.210526 80.701754 64.912281 59.649123 54.385965 56.140351 57.017544 64.912281 92.105263
1 Åland Islands 100.000000 16.358025 85.493827 24.382716 20.370370 32.716049 30.864198 37.037037 35.802469 45.370370 31.790123 45.679012 46.913580 45.679012 48.148148 49.691358
2 Albania 61.878453 61.049724 54.972376 59.944751 70.441989 77.348066 74.033149 74.033149 70.994475 70.441989 80.662983 78.729282 96.685083 88.397790 85.911602 100.000000
3 American Samoa 99.076923 84.615385 68.615385 62.769231 63.692308 100.000000 66.769231 53.846154 57.846154 45.230769 47.692308 53.230769 62.461538 50.461538 52.923077 56.923077
4 Andorra 100.000000 90.643275 76.842105 65.964912 56.374269 52.280702 46.432749 40.701754 39.064327 39.532164 36.959064 40.116959 37.076023 40.818713 39.766082 50.643275
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
222 United Kingdom of Great Britain and Northern I... 33.590734 93.050193 71.235521 76.254826 75.289575 82.625483 79.343629 67.181467 86.872587 91.505792 89.189189 86.293436 92.084942 94.594595 100.000000 88.610039
223 United States of America 100.000000 76.660342 56.925996 55.977230 57.495256 44.402277 39.848197 34.345351 32.637571 29.222011 29.222011 30.740038 27.514231 27.134725 43.453510 24.667932
224 United States Minor Outlying Islands 93.750000 41.666667 100.000000 71.875000 55.208333 68.229167 58.333333 53.645833 51.041667 47.395833 58.854167 55.729167 64.583333 60.937500 65.625000 83.333333
225 Viet Nam 15.515409 15.515409 14.877790 19.659936 34.112646 71.307120 96.811902 100.000000 91.710946 80.340064 86.078640 80.871413 70.350691 51.115834 47.608927 39.319872
226 Zimbabwe 77.339901 75.615764 61.330049 63.793103 85.221675 62.068966 61.330049 69.704433 70.197044 84.729064 80.788177 100.000000 96.551724 99.753695 94.581281 80.788177

227 rows × 17 columns

In [95]:
np.arange(2004,2020).tolist().append(1)
In [96]:
inbound_interest = inbound.copy()
inbound_interest[np.arange(2004,2020)] = inbound_interest[np.arange(2004,2020)].div(inbound_interest[np.arange(2004,2020)].max(axis=1), axis=0) * 100
inbound_interest = inbound_interest.drop(['country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum', 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003], axis = 1)
inbound_interest = inbound_interest.dropna()
inbound_interest
Out[96]:
name 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 France 87.334597 85.290783 88.986905 88.728503 88.844164 88.292477 87.125305 90.232103 90.657573 93.818990 94.823685 93.310446 93.191112 95.133493 97.301689 100.000000
1 United States of America 36.907271 39.024337 100.000000 95.698719 95.919270 87.624060 88.588695 80.397982 93.695693 97.888342 97.343215 96.553367 95.678241 95.148842 92.437366 90.337268
2 China 67.084620 74.008540 76.869409 81.133643 79.997908 77.813188 82.295832 83.317747 81.460951 79.414045 79.057820 82.331516 87.225141 94.291796 97.580873 100.000000
3 Mexico 96.222830 100.000000 94.721075 90.727706 90.113044 85.358618 79.453396 73.422139 74.408121 75.717914 78.570182 84.471526 91.959940 96.318810 93.553798 94.435073
4 Spain 68.146945 73.363716 76.208290 78.391852 77.411429 72.837442 74.299754 78.613775 77.774431 81.818974 84.920346 87.052390 91.591504 96.470635 98.641515 100.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
195 São Tomé and Príncipe 30.372493 45.272206 35.243553 33.810888 41.547278 43.553009 22.922636 34.957020 34.957020 34.957020 34.957020 73.352436 82.808023 82.808023 95.702006 100.000000
196 Montserrat 78.756477 67.875648 49.222798 45.077720 43.523316 37.823834 39.896373 38.341969 51.321244 45.077720 55.440415 68.911917 69.948187 96.373057 86.528497 100.000000
197 Marshall Islands 97.826087 100.000000 63.043478 78.260870 65.217391 58.695652 50.000000 50.000000 50.000000 50.000000 53.260870 68.478261 70.652174 85.869565 73.913043 66.304348
198 Niue 22.413793 24.137931 25.862069 30.172414 40.517241 40.517241 53.448276 52.586207 43.103448 60.344828 63.793103 66.379310 76.724138 100.000000 90.517241 87.931034
199 Tuvalu 35.135135 29.729730 29.729730 29.729730 45.945946 43.243243 45.945946 32.432432 29.729730 35.135135 37.837838 64.864865 67.567568 67.567568 83.783784 100.000000

200 rows × 17 columns

Here is a small multiples plot of inbound tourism and interest over time. I did not give it a title as there will be a white space.

In [97]:
nrow = 40
ncol = 5

interest = plt.figure(figsize = (15,80))
gs = interest.add_gridspec(nrow, ncol, hspace = 0.5, wspace = 0)
axes = gs.subplots(sharex = True, sharey = True)

count=0
for r in range(nrow):
    for c in range(ncol):
        inbound_interest.set_index('name').iloc[count].plot(title = inbound_interest.set_index('name').iloc[count].name, ax = axes[r,c], ylim = (0,100))
        try:
            year.set_index('index').loc[inbound_interest.set_index('name').iloc[count].name].plot(title = inbound_interest.set_index('name').iloc[count].name, ax = axes[r,c], ylim = (0,100))
        except:
            pass
        count += 1

plt.show()

I randomly chose 3 countries to test for any correlation: France, Albania, and China.

In [98]:
inbound_interest[inbound_interest.name == 'France'][np.arange(2004,2020)].T.rename(columns = {0: 'inbound tourism'}).plot()
year[year['index'] == 'France'].drop('index', axis= 1).T.rename(columns = {70: 'interest over time'}).plot()
Out[98]:
<AxesSubplot:>
In [99]:
france_interest = pd.merge(inbound_interest[inbound_interest.name == 'France'][np.arange(2004,2020)].T.rename(columns = {0: 'inbound tourism'}), year[year['index'] == 'France'].drop('index', axis= 1).T.rename(columns = {70: 'interest over time'}), left_index=True, right_index=True)
france_interest
Out[99]:
inbound tourism interest over time
2004 87.334597 100.000000
2005 85.290783 99.184783
2006 88.986905 95.923913
2007 88.728503 93.206522
2008 88.844164 87.092391
2009 88.292477 93.342391
2010 87.125305 92.119565
2011 90.232103 75.271739
2012 90.657573 66.983696
2013 93.818990 63.994565
2014 94.823685 61.413043
2015 93.310446 57.336957
2016 93.191112 51.358696
2017 95.133493 50.000000
2018 97.301689 54.483696
2019 100.000000 50.407609
In [100]:
inbound_interest[inbound_interest.name == 'China'][np.arange(2004,2020)].T.rename(columns = {2: 'inbound tourism'}).plot()
year[year['index'] == 'China'].drop('index', axis= 1).T.rename(columns = {39: 'interest over time'}).plot()
Out[100]:
<AxesSubplot:>
In [101]:
france_interest_over_time = px.line(france_interest, title = 'Percentage of Inbound Tourism and Interest Over Time of France', labels = {'value': 'Percentage', 'index': 'Year'}, markers = True)
france_interest_over_time
In [102]:
albania_interest = pd.merge(inbound_interest[inbound_interest.name == 'Albania'][np.arange(2004,2020)].T.rename(columns = {79: 'inbound tourism'}), year[year['index'] == 'Albania'].drop('index', axis= 1).T.rename(columns = {2: 'interest over time'}), left_index=True, right_index=True)
albania_interest
Out[102]:
inbound tourism interest over time
2004 10.068686 61.878453
2005 11.676553 61.049724
2006 14.626912 54.972376
2007 17.592882 59.944751
2008 22.166719 70.441989
2009 28.972838 77.348066
2010 37.730253 74.033149
2011 45.769591 74.033149
2012 54.854824 70.994475
2013 50.827349 70.441989
2014 57.336872 80.662983
2015 64.486419 78.729282
2016 73.930690 96.685083
2017 79.893850 88.397790
2018 92.522635 85.911602
2019 100.000000 100.000000
In [103]:
albania_interest_over_time = px.line(albania_interest, title = 'Percentage of Inbound Tourism and Interest Over Time of Albania', labels = {'value': 'Percentage', 'index': 'Year'}, markers = True)
albania_interest_over_time
In [104]:
china_interest = pd.merge(inbound_interest[inbound_interest.name == 'China'][np.arange(2004,2020)].T.rename(columns = {2: 'inbound tourism'}), year[year['index'] == 'China'].drop('index', axis= 1).T.rename(columns = {39: 'interest over time'}), left_index=True, right_index=True)
china_interest
Out[104]:
inbound tourism interest over time
2004 67.084620 100.000000
2005 74.008540 99.545455
2006 76.869409 87.878788
2007 81.133643 82.878788
2008 79.997908 83.484848
2009 77.813188 71.818182
2010 82.295832 75.606061
2011 83.317747 73.030303
2012 81.460951 71.666667
2013 79.414045 69.696970
2014 79.057820 66.666667
2015 82.331516 69.848485
2016 87.225141 69.545455
2017 94.291796 71.969697
2018 97.580873 71.818182
2019 100.000000 72.878788
In [105]:
china_interest_over_time = px.line(china_interest, title = 'Percentage of Inbound Tourism and Interest Over Time of China', labels = {'value': 'Percentage', 'index': 'Year'}, markers = True)
china_interest_over_time

iii. Area¶

Finally, I want to test for any correlation between the area of countries and their inbound tourism. This is because it seems logical that a larger country should have more inbound tourism.

In [106]:
trend_area = inbound[['name', 'region', 'country_alpha_3', 2019]].dropna().sort_values(by = 2019, ascending = False).reset_index(drop = True)
trend_area[2019] *= 1000000
trend_area = pd.merge(trend_area, area[['Country Code', 2019.0]], left_on =  'country_alpha_3', right_on = 'Country Code')
trend_area = trend_area.rename(columns = {'2019_x': 'inbound', '2019.0_y': 'area'})
trend_area = trend_area[(trend_area['area'] < 3000000)]
trend_area
Out[106]:
name region country_alpha_3 inbound Country Code area
0 France Europe FRA 217877000.0 FRA 547557.000
3 Spain Europe ESP 126170000.0 ESP 499570.036
4 Mexico Americas MEX 97406000.0 MEX 1943950.000
5 Italy Europe ITA 95399000.0 ITA 295717.000
6 Poland Europe POL 88515000.0 POL 306110.000
... ... ... ... ... ... ...
189 Micronesia Oceania FSM 18000.0 FSM 700.000
190 Kiribati Oceania KIR 12000.0 KIR 810.000
191 Turkmenistan Asia TKM 8200.0 TKM 469930.000
192 Marshall Islands Oceania MHL 6100.0 MHL 180.000
193 Tuvalu Oceania TUV 3700.0 TUV 30.000

188 rows × 6 columns

I chose to analyse the trend between different regions as it accounts for countries' location too.

In [107]:
inbound_area = px.scatter(trend_area, x = 'area', y = 'inbound', color = 'region', opacity=0.8, trendline='ols', title = 'The Correlation between Inbound Tourism and Area', labels = {'inbound': 'Inbound Tourism', 'area': 'Area'}).update_traces(marker=dict(size=5, line=dict(width=1, color='DarkSlateGrey')), selector=dict(mode='markers')).update_layout(legend_title = 'Region')
inbound_area
In [108]:
inbound_area_europe = trend_area[trend_area['region'] == 'Europe'].copy()
inbound_area_europe
Out[108]:
name region country_alpha_3 inbound Country Code area
0 France Europe FRA 217877000.0 FRA 547557.000
3 Spain Europe ESP 126170000.0 ESP 499570.036
5 Italy Europe ITA 95399000.0 ITA 295717.000
6 Poland Europe POL 88515000.0 POL 306110.000
7 Hungary Europe HUN 61397000.0 HUN 91260.000
8 Croatia Europe HRV 60021000.0 HRV 55964.000
11 United Kingdom Europe GBR 40857000.0 GBR 241930.000
13 Germany Europe DEU 39563000.0 DEU 349390.000
14 Czech Republic Europe CZE 37202000.0 CZE 77205.500
16 Greece Europe GRC 34005000.0 GRC 128900.000
17 Denmark Europe DNK 33093000.0 DNK 40000.000
19 Austria Europe AUT 31884000.0 AUT 82520.000
24 Netherlands Europe NLD 20129000.0 NLD 33670.000
29 Portugal Europe PRT 17283000.0 PRT 91605.600
31 Slovakia Europe SVK 16086000.0 SVK 48080.000
33 Ukraine Europe UKR 13710000.0 UKR 579400.000
36 Romania Europe ROU 12815000.0 ROU 230080.000
37 Bulgaria Europe BGR 12552000.0 BGR 108560.000
38 Ireland Europe IRL 12401000.0 IRL 68890.000
39 Belarus Europe BLR 11832000.0 BLR 202965.000
40 Switzerland Europe CHE 11818000.0 CHE 39516.030
45 Belgium Europe BEL 9343000.0 BEL 30280.000
51 Latvia Europe LVA 8342000.0 LVA 62227.000
52 Andorra Europe AND 8235000.0 AND 470.000
54 Sweden Europe SWE 7616000.0 SWE 407283.590
59 Albania Europe ALB 6406000.0 ALB 27400.000
61 Lithuania Europe LTU 6150000.0 LTU 62620.000
62 Estonia Europe EST 6103000.0 EST 42750.000
63 Norway Europe NOR 5879000.0 NOR 365094.000
70 Slovenia Europe SVN 4702000.0 SVN 20136.400
80 Malta Europe MLT 3519000.0 MLT 320.000
84 Finland Europe FIN 3290000.0 FIN 303930.000
88 Montenegro Europe MNE 2510000.0 MNE 13450.000
95 Iceland Europe ISL 2202000.0 ISL 100830.000
103 San Marino Europe SMR 1904000.0 SMR 60.000
105 Republic of Serbia Europe SRB 1847000.0 SRB 87460.000
122 Bosnia and Herzegovina Europe BIH 1198000.0 BIH 51200.000
126 Luxembourg Europe LUX 1041000.0 LUX 2574.460
142 Macedonia Europe MKD 758000.0 MKD 25220.000
146 Monaco Europe MCO 545000.0 MCO 2.027
170 Moldova Europe MDA 174000.0 MDA 32885.900
175 Liechtenstein Europe LIE 98100.0 LIE 160.000
In [109]:
sns.residplot(x = inbound_area_europe['area'], y = inbound_area_europe['inbound'])
plt.show()
In [110]:
train_idx, test_idx = train_test_split(inbound_area_europe.index, test_size=.25, random_state=0)
inbound_area_europe['split'] = 'train'
inbound_area_europe.loc[test_idx, 'split'] = 'test'

X = inbound_area_europe[['area']]
X_train = inbound_area_europe.loc[train_idx, ['area']]
y_train = inbound_area_europe.loc[train_idx, 'inbound']

model = LinearRegression()
model.fit(X_train, y_train)
inbound_area_europe['prediction'] = model.predict(X)

inbound_area_europe['residual'] = inbound_area_europe['prediction'] - inbound_area_europe['inbound']

inbound_area_europe_residplot = px.scatter(
    inbound_area_europe, x='prediction', y='residual',
    marginal_y='violin',
    color='split', trendline='ols'
)
inbound_area_europe_residplot
In [111]:
europe_area_coef, europe_area_value = stats.pearsonr(trend_area['area'], trend_area['inbound'])
europe_area_coef, europe_area_value
Out[111]:
(0.127905459662025, 0.0802483560283787)

Results Findings & Conclusion¶

For each research question, summarize in 2-3 visualizations which will answer the question. Intrepret the results accordingly and give your observation and conclusion. The visualizations should be well presented (apply what you have learnt in Chapter 9 on data communication). The plots shown here could be an enhanced version of the EDA plots, or presented in another format.

1. Which countries have the highest number of outbound tourism, inbound tourism, and domestic tourism?¶

i. Outbound tourism (departures)¶

In [112]:
outbound_map
Out[112]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [113]:
outbound_vbar
In [114]:
outbound_hbar

As we can see, United States of America has the highest number of outbound tourism, followed by Mexico and Germany.

ii. Inbound tourism (arrivals)¶

In [115]:
inbound_map
Out[115]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [116]:
inbound_vbar
In [117]:
inbound_hbar

As we can see, France has the highest number of inbound tourism, followed by United States of America and China.

iii. Domestic tourism (trips)¶

In [118]:
domestic_map
Out[118]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [119]:
domestic_vbar
In [120]:
domestic_hbar

As we can see, China has the highest number of domestic tourism, followed by United States of America and India.

In conclusion, United States of America, France, and India have the highest number of outbound, inbound, and domestic tourism respectively.

2. What is the country with the most visits (e.g. sum of inbound tourism and domestic tourism)?¶

In [121]:
arrivals_map
Out[121]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [122]:
arrivals_bar
In [123]:
arrivals_stacked_bar

In conclusion, China is the most visited country overall.

3. What are the countries with the most tourism in every region?¶

i. Outbound tourism (departures)¶

In [124]:
outbound_tree

Outbound Tourism:
Africa: Namibia
Americas: United States of America
Asia: China
Europe: Germany
Oceania: Australia

ii. Inbound tourism (arrivals)¶

In [125]:
inbound_tree

Inbound Tourism:
Africa: Namibia
Americas: United States of America
Asia: China
Europe: France
Oceania: Australia

iii. Domestic tourism (trips)¶

In [126]:
domestic_tree

Domestic Tourism:
Africa: South Africa
Americas: United States of America
Asia: China
Europe: United Kingdom
Oceania: Australia

iv. Arrivals¶

In [127]:
arrivals_tree

Arrivals:
Africa: South Africa
Americas: United States of America
Asia: China
Europe: United Kingdom
Oceania: Australia

v. Total¶

In [128]:
total_tree
In [129]:
total_group
In [130]:
total_box

Total:
Africa: Namibia
Americas: United States of America
Asia: China
Europe: United Kingdom
Oceania: Australia

Outbound Tourism: Africa: Namibia, Americas: United States of America, Asia: China, Europe: Germany, Oceania: Australia
Inbound Tourism: Africa: Namibia, Americas: United States of America, Asia: China, Europe: France, Oceania: Australia
Domestic Tourism: Africa: South Africa, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia
Arrivals: Africa: South Africa, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia
Total: Africa: Namibia, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia

4. Which country has the most tourism overall (e.g. sum of outbound tourism, inbound tourism, and domestic tourism)?¶

In [131]:
total_bar
In [132]:
total_stacked_vbar
In [133]:
total_stacked_hbar

In conclusion, China has the most tourism overall.

5. How did tourism grow across various countries?¶

i. Outbound tourism (departures)¶

In [134]:
outbound_growth

Let us focus on United States of America.

In [135]:
outbound_growth.update_traces({'line': {'color': 'lightgrey'}}).update_traces(patch = {'line': {'color': 'blue', 'width': 2}}, selector={'legendgroup': 'United States of America'}).add_annotation(ax = 2005, ay = 79215000 ,axref = "x", ayref='y', x = 2006, y = 148511000, text = 'Surge', showarrow=True, xshift = -30, yshift = 15, arrowhead = 4, arrowwidth = 2, font_size = 20)

We can see a sudden surge in outbound tourism for United States of America in 2006.

ii. Inbound tourism (arrivals)¶

In [136]:
inbound_growth

Let us focus on France.

In [137]:
inbound_growth.update_traces({'line': {'color': 'lightgrey'}}).update_traces(patch = {'line': {'color': 'blue', 'width': 2}}, selector={'legendgroup': 'France'}).add_annotation(ax = 1997, ay = 157551000 ,axref = "x", ayref='y', x = 1998, y = 70109000, text = 'Drop', showarrow=True, xshift = 10, yshift = 15, arrowhead = 4, arrowwidth = 2, font_size = 20).add_annotation(ax = 2003, ay = 75048000 ,axref = "x", ayref='y', x = 2004, y = 190282000, text = 'Surge', showarrow=True, xshift = -30, yshift = 15, arrowhead = 4, arrowwidth = 2, font_size = 20)

We can see that other than a drop from 1998 to 2003, the inbound tourism of France is consistently higher than other countries.

iii. Domestic tourism (trips)¶

In [138]:
domestic_growth

Let us focus on China.

In [139]:
domestic_growth.update_traces({'line': {'color': 'lightgrey'}}).update_traces(patch = {'line': {'color': 'blue', 'width': 2}}, selector={'legendgroup': 'China'}).add_annotation(ax = 1997, ay = 644000000 ,axref = "x", ayref='y', x = 2019, y = 6005852000, text = 'Steady Increase', showarrow=True, xshift = -30, yshift = 15, arrowhead = 4, arrowwidth = 2, font_size = 20)

We can see that the domestic tourism of China increases steadily and overtakes United States of America.

6. Are there any correlations between tourism of countries / regions and other variables (heritage, interest over time, etc.)?¶

i. Rankings of Countries¶

Here is a pairplot of the ranking of countries based on different aspects against the ranking of inbound tourism.

In [140]:
inbound_ranking
Out[140]:

From the pairplot, we can see that Heritage and Cultural Influence have the most correlation with Inbound Tourism.

Firstly, let us take a closer look at heritage.

In [141]:
inbound_heritage_contour

The most concentrated area with the most number of countries is where heritage is 9.5 and inbound tourism is 24.5.

In [142]:
inbound_heritage.add_annotation(x = 25, y = 120, text = 'Correlation Coefficient = ' + str(heritage_coef) + '<br>' + 'P-Value = ' +  str(heritage_value), showarrow = False, font_size = 20)

Since the correlation coefficient is close to 1, there is a positive relationship between the ranking of inbound tourism and ranking of heritage. Moreover, the p-value is significantly less than 0.001, which shows that there is a strong certainty in the result.

Moving on to cultural influence.

In [143]:
inbound_culture_contour

The most concentrated area with the most number of countries is where cultural influence is 9.5 and inbound tourism is 24.5.

In [144]:
inbound_culture.add_annotation(x = 25, y = 110, text = 'Correlation Coefficient = ' + str(culture_coef) + '<br>' + 'P-Value = ' +  str(culture_value), showarrow = False, font_size = 20)

As the correlation coefficient is not very close to 1, there is a slight positive relationship between the ranking of inbound tourism and ranking of cultural influence. Moreover, the p-value is less than 0.001, which shows that there is a strong certainty in the result.

ii. Interest over time¶

Here is a small multiples plot of inbound tourism and interest over time.

In [145]:
interest
Out[145]:

There is no clear correlation.

Let us look at France, for instance.

In [146]:
france_interest_over_time

Inbound tourism is increasing slightly while interest over time is decreasing. Inbound tourism reaches its peak at the end while interest over time achieves it at the start.

Next, let us look at Albania.

In [147]:
albania_interest_over_time

Inbound tourism is increasing slightly while interest over time is increasing more rapidly. Both inbound tourism and interest over time reach their peak at the end.

Finally, let us look at China.

In [148]:
china_interest_over_time

Inbound tourism is increasing while interest over time is decreasing. Inbound tourism reaches its peak at the end while interest over time achieves it at the start.

It is clear that there is no significant correlation between inbound tourism and interest over time.

iii. Area¶

A scatterplot of inbound tourism against area is plotted with the points being grouped by region.

In [149]:
inbound_area.add_annotation(x = 600000, y = 100000000, text = 'Europe Correlation Coefficient = ' + str(europe_area_coef) + '<br>' + 'Europe P-Value = ' +  str(europe_area_value), showarrow = False, font_size = 10)

Regression lines are plotted for each region as there is no clear trend overall.

Let us focus on Europe, for example.

In [150]:
inbound_area_europe_residplot

We can see that the residual plot has a fan shape. Hence, the prediction of inbound tourism gets worse as area increases.

In conclusion, heritage and cultural influence have decent correlation with inbound tourism, while interest over time is unrelated to inbound tourism, and area has weak and inaccurate correlation.

Recommendations or Further Works¶

State any recommendations, improvements or further works.

It would be useful to analyse tourism data of where people depart and arrive. However, more data will need to be collected in this area.
It would also be useful to record the purpose of tourism, so that the data analysis can focus more on entertainment.
For future works, we can analyse how COVID-19 affected tourism and the progress of recovery for various countries.
Furthermore, more correlations can be examined and researched to determine whether causation exists.

References¶

Cite any references made, and links where you obtained the data. You may wish to read about how to use markdown in Jupyter notebook to make your report easier to read. https://www.ibm.com/docs/en/db2-event-store/2.0.0?topic=notebooks-markdown-jupyter-cheatsheet
  1. https://wordpress.com/support/markdown-quick-reference/
  2. https://www.worlddata.info/tourism.php
  3. https://www.unwto.org/tourism-statistics-database
  4. https://www.unwto.org/tourism-data/unwto-tourism-dashboard
  5. https://www.unwto.org/tourism-data/international-tourism-and-covid-19
  6. https://www.unwto.org/tourism-statistics/key-tourism-statistics
  7. https://www.singstat.gov.sg/find-data/search-by-theme/industry/tourism/latest-data
  8. https://www.macrotrends.net/countries/USA/united-states/tourism-statistics
  9. https://stats.oecd.org/index.aspx?DataSetCode=TOURISM_DOMESTIC
  10. https://towardsdatascience.com/a-very-precise-fast-way-to-pull-google-trends-data-automatically-4c3c431960aa
  11. https://data.world/datasets/tourism
  12. https://gis.stackexchange.com/questions/108811/poi-database-of-tourist-attractions-or-poi-database-with-popularity-index
  13. https://www.geonames.org/
  14. https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes
  15. https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories
  16. https://www.iso.org/obp/ui/#search